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Introduction 

This  technical  report  covers  research  carried  out  by  the 
Secure  Distributed  Processing  Systems  group  at  UCLA  under  ARPA 
Contract  MDA-903-77-C-021 1  during  the  last  two  quarters  of  1978 
Progress  has  been  made  on  all  four  contracted  tasks>  namely 
network  security*  data  management  security*  high  availability 
secure  information  management*  and  UCLA  secure  system 
enhancements.  Below,  we  describe  that  progress. 


Task  I  -  Network  Security 

As  part  of  UCLA's  participation  in  the  larger  ARPA  sponsored 
network  security  experiment  employing  BCR  units  to  demonstrate 
that  end  to  end  encryption  of  individual  connections  on  the 
ARPANET  is  feasible,  a  BCR  unit  has  been  successfully  installed 
and  checked  out  at  UCLA  with  the  help  of  Collins  Radio  personnel 

Additional  work  was  also  done  to  understand  the  limitations 
of  public  key  based  encryption  algorithms*  especially  in  their 
application  to  digital  signatures.  After  the  inherent 
limitations  were  understood,  a  variety  of  methods  were  developed* 
using  either  public  key  or  conventional  algorithms*  which  are 
more  robust  than  the  unadorned  public  key  approach. 


Task  II  -  Data  Management  Security 

The  kernel  implementation  for  the  Ingres  database  system, 
which  was  begun  in  previous  quarters,  was  successfully  completed. 
The  resulting  prototype  kernel  provides  secure  operation  for  the 
normal  functions  of  retrieval  and  update.  Performance  of  the 
kernel  based  system  is  excellent*  comparable  to  that  of  the 
unaltered  Ingres  system  itself.  This  results  makes  it  quite 
clear  that  kernel  based  secure  data  management  systems  are 
feasible.  Additional  work  is  needed*  however*  to  demonstrate  how 
various  maintenance  functions  can  be  supported  under  the  security 
constraints. 


Task  III  -  High  Availability  Secure  Information  Management 

Signficant  results  were  completed  during  this  contract 
period.  Protocols  for  reliable  system  operation.  which  account 
for  fully  distributed  system  behavior.  crashes  of  components, 
partitions  of  the  underlying  network,  and  various  other  problems, 
were  developed  and  documented.  A  reasonably  efficient  method  of 
coordinating  the  recovery  of  a  large  set  of  cooperating  processes 
in  a  distributed  or  centralized  environment  was  developed.  On 
going  activity  of  arbitrary  members  of  the  cooperating  group  is 
permitted  by  the  recovery  method.  In  addition,  a  suitable  system 
architecture  in  which  this  facility  can  operate  was  designed.  It 
preserves  kernel  based  concepts  so  that  the  overall  security  of 
the  system  would  be  easier  to  demonstrate.  All  of  this  work  is 
described  in  further  detail  later  in  this  technical  report 

During  this  period,  we  also  analyzed  the  cost  of  enforcing 
integrity  constraints  in  data  bases  by  constructing  analytic 
models  for  each  of  the  possible  approaches.  We  were  able  to  show 
that  run  time  enforcement.  under  reasonable  assumptions,  is 
superior  to  all  of  the  other  proposed  methods.  including 
enforcement  at  compile  time.  This  work  is  reported  at  the  1979 
ACM  Sigmod  Conference. 


Task  IV  -  UCLA  Secure  System  Enhancment 

A  great  deal  of  progress  was  made  on  the  verification  of  the 
UCLA  Secure  Unix  kernel  during  this  period.  A  variant  of  the 
Alphard  methods  proposed  by  researchers  at  Carnegie  Mellon 
University  was  developed  for  use.  and  successfully  applied  in 
portions  of  the  proof  effort.  As  a  result  of  the  substantial 
progress.  both  in  developing  an  overall  proof  method  as  well  as 
in  its  application.  changes  were  made  both  in  the  system 
implementation  and  the  abstract  model  which  had  previously  been 
published. 


The  body  of  this  technical  report  consists  of  a  presentation 
of  the  substantial  results  in  reliable  distributed  systems 
mentioned  earlier  under  Task  III. 
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ing.  Recovery  is  done  in  such  a  way  that  maximum  forward 


progress  is  achieved.  The  proposed  protocol  supports' the 
integration  of  virtually  any  locking  discipline  including 
predicate  locking  methods.  A  cost  and  delay  analysis  of  the 
protocol  reveals  that  its  performance  does  not  depend  on  the 
size  of  the  network  for  most  topologies  of  interest.  The 
protocol  is  formally  described  using  state  transition  di¬ 
agrams  and  a  proof  of  its  correctness  is  included  in  this 
work.  A  proposal  for  an  extension  aimed  at  optimizing 
operation  of  the  protocol  to  adapt  to  highly  skewed  distri¬ 
butions  of  activity  is  also  presented. 

Two  protocols  for  the  detection  of  deadlocks  in  distri¬ 
buted  databases  -  a  hierarchically  organized  and  a  distri¬ 
buted  one  -  are  introduced  in  this  dissertation.  A  graph 
model  which  depicts  the  state  of  execution  of  all  transac¬ 
tions  in  the  system  is  used  by  both  protocols.  A  cycle  in 
this  graph  is  a  necessary  and  sufficient  condition  for  a 
deadlock  to  exist.  Nevertheless,  neither  protocol  requires 
that  the  global  graph  be  built  and  maintained  in  order  for 
deadlocks  to  be  detected.  In  the  case  of  the  hierarchical 
protocol,  the  communications  cost  can  be  optimized  if  the 
topology  of  the  hierarchy  is  appropriately  chosen. 

A  solution  to  the  problem  of  crash  recovery  in  distri¬ 
buted  systems  is  given  here.  A  formal  model  of  crash 
recovery  using  backward  error  recovery  techniques  is  intro- 
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CHAPTER  1 
INTRODUCTION 

1.1  intr<?<av<;ti<?n 

A  development  which  has  significantly  affected  the  com¬ 
puter  science  field  over  the  last  few  years  and  that  is  of 
interest  to  this  dissertation  is  the  introduction  of  comput¬ 
er  communication  networks.  The  ARPANET  [ROBE  70,  KLEI  70], 
a  successful  experiment  in  packet  switching  computer  commun¬ 
ications  networks,  demonstrated  the  feasibility  of  intercon¬ 
necting  computers  in  a  reliable  and  cost-effective  way  for 
the  purpose  of  transfering  data. 

Today,  when  most  of  the  basic  problems  in  the  design  of 
computer  networks  seem  to  be  solved,  it  becomes  possible  to 
consider  distributed  applications.  Such  applications, 
called  distributed  systems.  should  provide  services  or 
resources  to  a  multitude  of  geographically  dispersed  users 
in  such  a  way  that  any  service  or  resource  may  be  requested 
by  the  user  using  a  site  independent  name.  In  this  way,  the 
user  does  not  have  to  be  concerned  with  the  topology  of  the 
network  nor  with  the  location  of  the  objects  he  needs. 

Examples  of  the  kinds  of  distributed  systems  we  are 
considering  in  this  dissertation  include  distributed  data¬ 
base  management  systems  and  office  automation  systems. 

1 


This  dissertation  is  primarily  concerned  with  aspects 


of  coordination  of  resources  in  distributed  systems.  Solu¬ 
tions  to  issues  such  as  concurrency,  crash  recovery  and  da¬ 
tabase  synchronization  are  given  here. 

1-2.  -  Objectives 

The  goals  of  this  dissertation  are: 

1.  to  develop  synchronization  protocols  for  coordina¬ 
tion  of  access  to  distributed  databases.  These 
protocols  should  be  robust  in  the  face  of  addi¬ 
tional  failures,  preserve  database  consistency  and 
exhibit  reasonable  performance. 

2.  to  develop  algorithms  for  detection  of  deadlocks 
in  distributed  databases.  These  algorithms  should 
not  require  complete  information  about  the  status 
of  all  the  active  transactions  in  the  system  and 
or  system  resources  in  order  for  all  possible 
deadlocks  to  be  detected. 

3.  to  develop  a  formal  model  of  crash  recovery  in 
computer  systems.  This  model  should  represent  the 
history  of  each  recoverable  object  in  the  system 
as  well  as  the  information  flow  between  them. 

4.  to  develop  algorithms,  based  on  the  formal  model 


2 


sW 


mentioned  above,  to  perform  crash  recovery  in  cen¬ 


tralized  and  distributed  systems.  The  algorithms 
for  distributed  systems  should  take  into  account 
the  fact  that  there  is  no  central  repository  of 
information  regarding  the  status  of  the  various 
objects  in  the  system. 

5.  to  outline  the  architecture  of  the  UCLA  Distribut¬ 
ed  Secure  System  Base  (DSSB)  [MENA  78c].  The 
recovery  facilities  of  the  DSSB  utilize  the  model 
and  algorithms  for  crash  recovery  mentioned  above. 


1.2  -  Organization  siL  this,  dl a a$c Nation 

V  K 
\ 

‘''In  chapter  Z.  we  identify  the  major  issues  in  the  design 
of  distributed  systems,  namely:  interconnection  structures, 
error  recovery,  user  interface,  security  and  program  and 
data  assignment.  Also,  in  this  chapter  we  discuss  some  of 
the  aspects  of  the  design  of  distributed  database  management 
systems  such  as  data  integrity,  concurrency  control,  crash 
recovery  and  synchronization  protocols. 

Chapter  2  addresses  the  issue  of  concurrency  control 
and  deadlock  handling  in  distributed  databases.  A  locking 
protocol  with  adaptive  centralized  control  and  distributed 
recovery  procedures  is  presented.  A  hierarchical  extension 
of  this  protocol  is  also  proposed.  In  the  same  chapter, 


we 


present  two  deadlock  detection  protocols  for  distributed  da¬ 


tabases  -  a  distributed  one  and  a  hierarchically  organized 
one  . 

Chapter  4.  is  concerned  with  issues  of  crash  recovery  in 
computer  systems.  In  particular,  it  addresses  the  following 
problem.  Given  a  set  of  objects  (e.g.  processes,  files, 
etc.),  a  set  of  snapshots  for  each  of  them  and  the  records 
of  information  flow  between  them,  find  the  set  of  objects 
and  their  snapshots  which  should  be  used  to  restore  the 
state  of  the  system  to  a  consistent  state,  once  an  error  in 
any  of  the  (potentially  interacting)  objects  is  detected. 
An  efficient  algorithm  to  perform  consistent  state  restora¬ 
tion  in  a  centralized  computer  system  is  given  in  this 
chapter  . 

Protocols  for  performing  consistent  state  restoration 
in  distributed  systems  are  developed  and  presented  in 
shapfrsr  l. 

Finally,  chapter  £  presents  an  outline  of  the  architec¬ 
ture  of  the  UCLA  Distributed  Secure  System  Base. 


CHAPTER  2 


CURRENT  ISSUES  IN  DISTRIBUTED  SYSTEMS 
AND  DISTRIBUTED  DATABASE  MANAGEMENT  SYSTEMS 


2-1  - 


This  chapter  discusses  the  major  issues  involved  in  the 
design  of  distributed  systems  and  of  distributed  database 
management  systems.  Some  examples  of  applications  for  the 
kind  of  distributed  systems  we  are  considering  here  are:  of¬ 
fice  automation  systems,  message  systems  and  distributed  da¬ 
tabase  management  systems. 


Office  automation  systems  are  systems  intended  to  au¬ 
tomate  all  the  operations  in  an  office.  Such  systems  can  be 
envisioned  as  a  collection  of  intelligent  terminals  or  work 
stations  connected  via  a  local  network.  A  typical  work  sta¬ 
tion  would  consist  of  a  CRT  display,  a  keyboard,  local  main 
memory,  a  processor  and  a  floppy  disk.  These  stations 
would  have  enough  processing  power  to  carry  out  word  pro¬ 
cessing  functions  such  as  text  editing  and  formatting  and 
also  for  running  related  software.  Office  automation  sys¬ 
tems  must  include  some  form  of  message  processing  system  and 
some  form  of  distributed  database  management  system. 


Message  Processing  Systems  provide  automatic  message 
handling  facilities  such  as:  message  composition,  message 
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distribution,  message  filing,  message  retrieval,  message  an¬ 
notation  and  the  like.  Message  Processing  Systems  are  gen¬ 
erally  a  part  of  an  office  automation  system. 


A  Distributed  Database  Management  System  provides  the 
means  through  which  data  elements  which  are  stored  in  multi¬ 
ple,  independently  operating  sites  of  a  computer  network, 
can  be  accessed.  Such  systems  must  include  the  functional 
capability  to  locate  and  access  the  requested  data  and  to 
ensure  that  the  consistency  of  the  database  is  preserved  in 
the  face  of  concurrent  access  and  in  the  face  of  crashes. 

Z-Z  -  Issue?  Ill  Distributed  Systems 

The  phrase  "distributed  system"  hns  been  used  so  often 
to  refer  to  fundamentally  different  classes  of  systems  that 
its  meaning  has  become  too  vague  to  be  useful.  Some  exam¬ 
ples  of  classes  of  systems,  which  have  been  labeled  as  dis¬ 
tributed  systems  are:  packet  switching  computer  communica¬ 
tions  networks  such  as  the  ARPANET  [ROBE  70],  network 
operating  systems  such  as  the  RSEXEC  [THOM  73]  and  the  Na¬ 
tional  Software  Works  (NSW)  [CROC  75],  tightly  coupled  mi¬ 
croprocessor  systems  such  as  the  Cm*  [SWAN  77]  and  even  col¬ 
lections  of  intelligent  terminals  (e.g.  terminals  with  some 
local  editing  and  formatting  capability)  connected  to  a 
mainframe  in  a  star-like  configuration. 
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To  narrow  down  the  characterization  of  distributed  sys¬ 
tems  we  will  use  a  particular  definition,  originally  given 
in  [ENSL  78],  which  encompasses  the  motivations  laid  out  in 
the  previous  section. 


system  is  a  collection  of 


elements  (e.g.  processors, 


terminals,  etc.)  interacting  through  a 
communications  network  in  a  cooperativ* 


fashion. 


It  is  also  required  that  there  be  a  high  level 
operating  system  that  unifies  and  integrates  the 
control  of  the  distributed  elements  and  that  pro¬ 


vides  system 


to  users.  It  should  be 


possible  for  the  various  system  resources  to  be 
dynamically  assigned  to  tasks  and  there  should  be 


multiple 


of  these  resources." 


Some  comments  about  the  above  definition  are  in  order. 
The  communications  network  assumes  that  information  flow 
between  two  sites  takes  place  according  to  a  two-party  pro¬ 
tocol.  The  elements  that  compose  the  system  should  interact 
in  a  cooperative  fashion  at  the  logical  level  and  at  a  phy¬ 
sical  level.  This  interaction  should  preserve  the  autonomy 
of  each  of  the  interacting  elements.  The  high  level  operat¬ 
ing  system  should  allow  users  to  refer  to  the  services  they 
need  with  no  concern  over  the  identity  and  location  of  the 


server  providing  it. 


Therefore,  for  our  purposes,  computer  networks,  mul¬ 
tiprocessors  and  intelligent  terminals  will  not  be  con¬ 
sidered  among  the  distributed  systems  addressed  in  this 
thesis.  Computer  networks  do  not  exhibit  the  system  tran¬ 
sparency  required  by  the  definition.  Multiprocessors  by 
themselves,  with  no  high  level  operating  system  cannot  be 
considered  a  distributed  system.  Intelligent  terminals 
among  other  things,  do  not  meet  the  requirement  that  in¬ 
teraction  between  system  elements  be  cooperative. 

Z-Z-Z  -  The  Value  £l  Distributed  Sya.tema 

One  of  the  reasons  for  having  a  distributed  system  is 
to  achieve  resource  sharing .  One  would  like  to  be  able  to 
share  specialized  software  or  hardware  resources,  such  as 
image  processing  programs,  word  processors  or  number  crunch¬ 
ing  machines. 

Distributed  systems,  if  properly  designed,  have  the  po¬ 
tential  of  exhibiting  high  reliability  and  high  availability 
over  centralized  systems.  This  is  only  possible  if  enough 
redundancy  is  available,  if  efficient  and  correct  error 
recovery  mechanisms  are  implemented,  and  if  the  architecture 
assures  that  errors  do  not  propagate. 


Another  motivation  for  having  distributed  systems  is 


the  potential  for 


sharing, 


changes  the  work  load  and  good  response  to  transient 
overloads .  In  order  for  the  just  mentioned  characteristics 
to  be  present  the  system  must  have  the  ability  to  dynamical¬ 
ly  move  executing  tasks  to  other  sites. 


With  the  advent  of  large  scale  integration  (LSI), 
hardware  costs  have  declined  in  such  a  way  that  today  the 
cost  of  a  mainframe  is  an  order  of  magnitude  greater  than 
the  one  composed  of  a  collection  of  minicomputers  ,  which 
have  collectively  the  same  computational  power  as  the  main¬ 
frame.  In  other  words,  distributed  systems  exhibit  less 


coat  for  more 


ja.Q  we  r  ■ 


The  decrease  in  hardware  costs  made  it  feasible  for 
small  departments  within  an  enterprise  to  have  their  own 
small  computers  [SALT  78].  There  is  need  for  these  machines 
to  be  connected  together  so  that  operations  that  span 
departments  (e.g.  weekly  financial  reports  of  the  enter¬ 
prise)  can  be  automated.  This  arrangement  reflects  the 
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Let  ua  now  identify  the  major  issues  in  the  design  of 
distributed  systems. 


Z- Z- 2.1  - 


The  different  machines  in  a  distributed  system  can  be 
connected  in  several  distinct  ways.  Among  the  possibilities 
one  should  distinguish  between  local  and  geographically  dis¬ 
tributed  networks.  Networks  which  interconnect  computers 
located  in  a  reasonably  small  area  -  of  the  order  of  a  few 
square  miles  -  such  as  an  office  building  or  even  a  room  are 
called  local  networks .  For  such  short  distances,  one  can 
use  communication  lines  which  exhibit  high  bandwidth  and  low 
delay.  With  current  technology,  high  bandwidth  wires  cannot 
be  used  for  distances  over  one  mile  or  so.  Two  examples  of 
local  networks  are  the  Distributed  Computing  System  (DCS) 
built  at  the  University  of  California,  Irvine  [FARB  72c]  and 
the  ETHERNET  built  at  the  Xerox  Palo  Alto  Research  Center 
[METC  76].  DCS  has  a  ring  topology  -  every  node  is  connect¬ 
ed  to  two  others.  The  ETHERNET  is  a  broadcast  type  of  net¬ 
work  in  which  all  the  nodes  are  connected  to  a  transmission 
medium,  called  the  "ether",  which  is  pair  of  twisted  cables. 
Messages  sent  on  the  ether  can  be  received  by  all  the  nodes 
-  this  is  called  a  broadcast  type  of  transmission.  Messages 
can  be  destroyed  due  to  collisions  with  other  messages 
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transmitted  almost  simultaneously.  Collision  detection  and 
retransmission  protocols  are  required. 


If  the  machines  on  the  network  are  spread  through  a 
large  geographical  area,  we  have  a  geographically 
distributed  network.  Perhaps  the  best  known  example  of  such 
a  network  is  the  ARPANET  [ROBE  70].  Private  or  leased  tele¬ 
phone  lines  and/or  satellite  links  are  used  to  connect  com¬ 
puters  in  distributed  networks.  These  connections  have  low 
bandwidth  and  high  delay  characteristics. 

The  kind  of  interconnection  structure,  i.e.  local  or 
geographically  distributed,  certainly  impacts  considerably 
the  design  of  a  distributed  system.  As  an  example,  the 
design  of  the  DSSB  described  in  chapter  6  of  this  disserta¬ 
tion,  utilizes  the  fact  that  fat  wires  are  available  to  im¬ 
plement  a  network  wide  page  faulting  mechanism. 

2. •  iL •  1  •  2  -  Error  flftgft ysr.J 

As  we  pointed  out  in  section  2.2.2,  one  of  the  main  ob¬ 
jectives  of  a  distributed  system  is  to  have  high  reliability 
and  high  availability.  The  potential  for  achieving  this 
goal  is  present  in  distributed  systems.  However,  in  order 
for  this  potential  to  materialize,  it  is  necessary  that  ap¬ 
propriate  error  reoovery  facilities  be  built  into  the  sys¬ 
tem.  These  facilities  should  be  mostly  automatic  since  the 
user  has  little  or  no  control  over  the  current  location  of 
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its  processes  and  files  given  that  system  transparency  is 
one  of  the  desirable  properties  of  a  distributed  system. 
Also,  in  general  we  try  to  insulate  users  from  problems  not 
directly  related  to  getting  their  task  done. 

There  are  basically  two  strategies  for  error  recovery; 
namely,  backward  error  recovery  and  forward  error  recovery. 
In  the  former,  the  system  must  be  backed  up  to  a  previously 
saved  snapshots.  The  collection  of  snapshots  of  the  objects 
backed  up  (e.g.  files,  processes)  must  restore  the  system  to 
a  globally  consistent  state,  i.e.,  a  state  which  the  system 
might  have  reached  through  normal  operation,  starting  at  a 
consistent  state.  The  latter  strategy  requires  that  the 
state  of  the  damaged  objects  be  corrected  -  with  the  aid  of 
some  redundancy  -  and  that  operations  be  restarted.  The  ap¬ 
plicability  of  forward  error  recovery  techniques  is  very 
much  dependent  on  the  nature  of  the  error  itself  and  of  its 
consequences  . 

While  error  recovery  is  not  a  unique  characteristic  of 
centralized  systems,  it  becomes  harder  and  crucial  in  dis¬ 
tributed  ones.  The  main  reason  is  the  time  delay  inherent 
to  computer  networks  combined  with  the  fact  that  status  in¬ 
formation  about  the  several  objects  in  the  system,  is  dis¬ 
tributed  and  not  centralized.  In  this  dissertation  we  ad¬ 
dress  formally  the  issue  of  crash  recovery  in  computer  sys¬ 
tems  (see  [MENA  793)  and  we  develop  algorithms  to  carry  out 


error  recovery  in  distributed  systems. 


Z-2‘2'2  -  Metwork  Operating  5 yg&901.a 

As  we  mentioned  in  section  2.2.1,  a  high  level  operat¬ 
ing  system  is  a  necessary  element  of  a  distributed  system. 
This  operating  system  will  be  called  network  operating 
system  ( NOS )  .  An  NOS  should  provide  to  the  user  a  tran¬ 
sparent  view  of  the  system  in  the  sense  of  allowing  services 
to  be  requested  from  the  system  by  site  independent  names 
and  not  from  servers  whose  locations  must  be  known  by  the 
user.  Therefore,  the  different  resources  of  the  system 
should  be  controlled  and  made  accessible  to  the  user  by  the 
NOS  in  a  convenient  and  cost-effective  way.  One  require¬ 
ment  of  an  NOS  is  that  all  of  its  functions,  such  as 
resource  allocation,  scheduling,  synchronization,  be  decen¬ 
tralized  . 


A  network  operating  system  should  have  the  following 
two  components: 


1.  A  network  wide  interprocess  communication  mechan¬ 
ism.  This  mechanism  coupled  with  a  network  wide 
process  name  space  makes  the  network  transparent 
as  far  as  interprocess  communication  is  concerned. 

2.  A  distributed  file  system  together  with  a  network 
wide  file  name  space.  Such  a  mechanism  makes  the 


network  transparent  as  far  as  file  usage  is  con¬ 
cerned  . 
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Two  network  operating  systems  have  been  developed  for  a 
collection  of  nodes  in  the  ARPANET.  One  of  them,  the  RSEXEC 
[THOM  73] »  provides  a  distributed  file  system  which  makes 
all  files,  regardless  of  location,  uniformly  accessible. 
The  other  is  the  National  Software  Works  (NSW)  [CROC  75] 
which  is  designed  to  provide  the  users  of  the  system  with  a 
uniform  way  of  accessing  a  variety  of  software  tools  regard¬ 
less  of  their  location  in  the  network.  Unlike  RSEXEC,  file 
operations  are  logically  centralized  in  NSW  due  to  the  ex¬ 
istence  of  a  central  catalog  used  to  resolve  file  refer¬ 
ences  . 

A  third  example  of  a  distributed  operating  system  is 
the  one  developed  for  the  Distributed  Computing  System  (DCS) 
[FARB  72  and  ROWE  75].  An  interesting  feature  of  this  sys¬ 
tem  is  its  resource  allocation  scheme  which  is  based  on  a 
bidding  strategy.  A  process  which  requires  a  service  bar¬ 
gains  with  the  processes  which  can  supply  it.  The  request¬ 
ing  process  then  picks  up  the  bidder  which  has  indicated  a 
minimal  amount  of  induced  overhead. 
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The  issue  of  security  in  centralized  computer  systems 
has  gained  considerable  attention  from  the  research  communi¬ 
ty  over  the  past  few  years.  Among  the  leading  projects  in 
the  area  is  the  UCLA  Data  Secure  Unix  Operating  System 
[KAMP  77].  It  is  a  kernel  based  operating  system.  All  the 
security  relevant  functions  of  the  operating  have  been  iso¬ 
lated  in  the  kernel  which  has  been  designed  to  be  amenable 
to  validation  [POPE  7M].  The  kernel  is  currently  being  ver¬ 
ified  for  its  security  properties  [KEMM  79  and  POPE  78b]. 


Decentralization  adds  another  dimension  to  the  problem 
of  security  in  computer  systems  since  communications  lines 
and  switching  processors  cannot  be  assumed  to  be  secure.  As 
pointed  out  by  Popek  and  Kline  in  [POPE  78a]  the  environment 
presents  several  threats  such  as  the  tapping  of  communica¬ 
tion  lines,  introduction  of  spurious  traffic,  retransmission 
of  previously  transmitted  genuine  traffic  or  breaking  of 
physical  lines. 


The  only  general  solution  to  the  above  problems  is  the 
use  of  some  form  of  encryption.  Related  issues  are:  the 
placement  of  the  endpoints  of  the  encryption  pairs,  key 
distribution,  confinement  and  authentication. 
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Z-Z-l-1  -  Program  and  Data  Assignment 

In  order  to  adapt  to  changes  in  the  work  load  and  to 
adapt  to  failures,  a  distributed  system  must  have  the  capa¬ 
bility  to  move  programs  and  files  in  a  dynamic  way.  This 
fact  raises  the  issue  of  finding  optimal  strategies  to  allo¬ 
cate  data  and  files  to  processors  in  order  to  minimize  cost. 

Z-l  -  Issues  In  Distributed  Database  Manaftamfcnt 

This  section  highlights  the  major  issues  in  distributed 
database  management  systems. 

Z-l-1  -  Bate  InJLg.ftrJ^y 

A  database  is  an  information  model  of  a  real  world  en¬ 
vironment  and  as  such  it  should  reflect  the  properties  of 
this  environment.  In  general,  the  database  integrity  is 
said  to  be  violated  if  there  is  an  inconsistency  between  it 
and  the  real  world  it  models.  Two  distinct  types  of  con¬ 
sistency  notions  have  to  be  considered  in  DDBs ,  namely: 
internal  consistency  and  mutual  consistency. 

The  notion  of  internal  consistency  applies  to  distri¬ 
buted  databases  as  well  as  to  centralized  ones.  A  database 
is  said  to  be  in  an  internally  consistent  state  if  it  satis¬ 
fies  a  set  of  predefined  assertions  or  g<?ng  IgtenS-Y 
constraints .  For  instance,  in  an  airline  reservation  sys- 
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tern,  a  consistency  constraint  may  be  that  the  number  of 
reserved  seats  does  not  exceed  the  capacity  of  the  plane. 
In  a  banking  system  we  might  require  that  the  sum  of  all  the 
balances  of  customer's  accounts  and  assets  be  equal  to  zero. 

The  notion  of  mutual  consistency  applies  only  to  DDBs 
with  multiple  copies.  As  defined  in  [THOM  76],  a  set  of 
multiple  copies  of  a  DDB  is  said  to  be  mutually  consistent 
if  after  all  activity  ceases  they  all  converge  to  the  same 
value . 


Eswaran  et  al .  introduced,  in  [ESWA  76],  the  notion  of 
transaction  as  the  unit  of  consistency.  A  transaction, 
which  is  a  set  of  actions  of  the  form:  read,  write,  lock  or 
unlock,  takes  the  DB  from  a  consistent  state  into  another 
consistent  state.  Even  if  transactions  are  programmed  in 
such  a  way  that  they  individually  preserve  the  consistency 
constraints  of  the  DB,  consistency  can  be  compromised  by  the 
two  following  factors:  concurrent  access  to  the  database  and 
the  occurrence  of  crashes.  These  issues  will  be  discussed 
in  the  following  sections. 


1-2-2  -  Concurrency  Control 

If  each  transaction  by  itself  preserves  the  consistency 
constraints  and  if  transactions  are  executed  one  after  the 
other,  i.e.,  in  a  serial  way,  the  database  internal  con¬ 


sistency  is  preserved.  However,  one  may  want  to  interleave 


the  execution  of  the  actions  of  several  transactions  in  ord¬ 
er  to  increase  system  concurrency.  This  interleaving  cannot 
be  arbitrary  otherwise  transactions  could  be  presented  with 
an  inconsistent  view  of  the  database.  To  understand  the  is¬ 
sue  consider  the  following  banking  transaction. 

TRANSFER  $5  FROM  SAVINGS  ACCT  #  XXX 
INTO  CHECKING  ACCT  #  YYY . 

After  the  five  dollars  have  been  withdrawn  from  the 
savings  account  but  before  they  are  credited  into  the  check¬ 
ing  account,  the  database  is  temporarily  in  an  internally 
inconsistent  state.  This  temporary  inconsistency  can  be 
tolerated  as  long  as  it  cannot  be  "seen"  by  any  other  tran¬ 
saction,  i.e.  another  transaction  cannot  read  the  data  which 
is  inconsistent.  We  need  therefore,  a  concurrency  control 
mechanism  to  control  the  way  in  which  the  actions  of  several 
transactions  are  interleaved  in  order  to  preserve  DB  con¬ 
sistency. 

A  particular  sequence  of  the  actions  of  several  tran¬ 
sactions  is  called  a  schedule .  A  serial  schedule  is  a  se¬ 
quence  of  actions  such  that  between  two  actions  of  the  same 
transactions  there  is  no  action  of  any  other  transaction. 
It  is  clear  that  serial  schedules  preserve  database  con¬ 
sistency.  A  schedule  is  said  to  be  serializable  if  it  pro¬ 
duces  the  same  result  as  some  serial  schedule. 


As  shown  in  [ESWA  76]  transactions  should  satisfy  two 
conditions  in  order  for  any  schedule  to  be  serializable, 


namely  they  should  be  well  formed  and  two-phase .  A  transac¬ 
tion  is  said  to  be  well  formed  if  all  the  data  items  it 
accesses  are  locked  in  the  appropriate  mode  by  the  transac¬ 
tion.  For  instance,  before  transaction  T  is  able  to  write 
into  data  item  i  »  this  data  item  must  be  locked  in  ex¬ 
clusive  mode  by  T.  A  transaction  is  said  to  be  two-phase  if 
it  is  not  allowed  to  acquire  any  further  lock3  after  it  has 
released  the  first  lock.  Therefore,  two  phase  transaction 
have  a  growing  chase  during  which  they  are  allowed  to  ac¬ 
quire  new  resources  and  a  shrinking  chase  which  starts  when 
the  first  release  resource  request  is  issued  and  ends  with 
the  transaction.  Figure  2.1  shows  an  example  of  a  well 

LOCK  A 
READ  A 
LOCK  B 
READ  B 
UPDATE  B 
UNLOCK  B 
UPDATE  A 
UNLOCK  A 

FIGURE  2.1  -  Example  of  a  well  formed 
and  two-phase  transaction. 

formed  and  two  phase  transaction. 

An  undesirable  side-effect  of  the  need  of  locking 
resources  is  that  of  a  deadlock.  A  deadlock  is  a  situation 
in  which  a  transaction  cannot  make  not  make  any  further  pro¬ 
gress  because  it  needs  a  resource  being  held  by  any  other 
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transaction  which  is  itself  waiting  for  a  resource  held  by 
the  first  transaction.  This  mutual  blocking  may  involve 
more  than  two  transactions  in  a  cyclic  pattern.  For  exam¬ 
ple,  consider  three  transactions  T1,  T2  and  T3  and  three 
resources  a,  b  and  c.  Assume  that  T1  has  resource  a.  locked 
in  exclusive  mode  and  is  waiting  to  acquire  resource  £  which 
is  held  by  T2  in  exclusive  mode.  Assume  that  T2  is  waiting 
for  resource  £,  which  is  held  by  T3  which  is  waiting  for 

i 

resource  a.  Therefore,  none  of  the  transactions  is  able  to 
continue  and  therefore  none  of  them  is  able  to  release  the 
resource  which  is  needed  by  the  others. 

i 

|  There  are  several  alternatives  to  handling  deadlocks  in 

i 

|  DDBs,  namely:  deadlock  avoidance,  deadlock  prevention, 

deadlock  detection  and  resolution  and  transaction  timeout, 

I  backout  and  restart.  The  advantages  and  disadvantages  of 

the  several  alternatives  are  discussed  in  chapter  3  of  this 
dissertation.  Deadlock  detection  protocols  (see  [MENA  78b]) 
are  also  presented  there. 

' 

[  2-2-2  -  Cxasb  Recovery 

I 

As  we  saw  in  the  previous  section,  temporary  incon¬ 
sistencies  in  the  DB  can  be  generated  after  some  of  the  up¬ 
dates  of  a  transaction  but  not  all  of  them  have  been  ap¬ 
plied.  If  a  crash  occurs  at  this  point,  the  database  will 


be  left  in  an  inconsistent  state.  In  order  to  remedy  this 


written,  one  can  start  carrying  it  out.  If  a  crash  occurs 
within  this  period  we  only  have  to  start  to  carry  out  the 


intentions  list  from  the  beginning  once  again.  After  the 
intentions  list  is  written,  crash  recovery  is  guaranteed. 
In  a  distributed  database  management  system,  there  is  a  pro¬ 
cess  which  coordinates  the  updates  of  a  given  transaction. 
This  coordinator  requests  that  all  intentions  lists  be  writ¬ 
ten  at  all  the  participating  sites  before  it  requests  that 
anyone  of  them  starts  to  carry  them  out.  This  protocol  was 
also  described  by  Gray  in  [GRAY  78]  and  called  the  two-phase 
commit  protocol . 

2-2-1  -  Synchronization  Protocols 

Synchronization  protocols  are  the  set  of  rules  which 
coordinate  the  access  to  a  DDB.  These  protocols  should  deal 
with  multiple  copies,  deadlocks,  DB  consistency  and  crash 
recovery . 

Several  such  protocols  have  been  suggested  in  the 
literature  [ALSB  76,  BADA  78,  ELLI  77a,  ELLI  77b,  LELA  78, 
ROSE  77,  MENA  78a,  STON  78,  and  ROTH  77].  In  chapter  3  of 
this  dissertation  we  list  the  desirable  properties  of  syn¬ 
chronization  protocols  and  then  we  describe  a  protocol  which 
satisfies  the  desired  properties. 
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A  characterization  of  distributed  systems  was  given  in 


this  chapter.  The  major  issues  in  the  design  of  such  sys¬ 


tems  are:  interconnection  structures,  error  recovery,  net¬ 


work  operating  systems,  security  and  program  and  data  as¬ 


signment 


The  major  issues  in  distributed  database  management 


systems,  as  identified  in  this  chapter,  are  data  integrity, 


concurrency  control,  crash  recovery  and  synchronization  pro¬ 


tocols  . 


CHAPTER  3 


CONCURRENCY  CONTROL  AND  DEADLOCK  HANDLING 
IN  DISTRIBUTED  DATABASE  MANAGEMENT  SYSTEMS 

1.1  -  Introduction 

This  chapter  is  concerned  with  issues  of  concurrency 
control  and  deadlocks  in  distributed  database  management 
systems.  A  locking  protocol  for  coordination  of  resources 
and  the  maintenance  of  database  consistency  throughout  nor¬ 
mal  and  abnormal  conditions  is  presented  here.  The  protocol 
has  centralized  control  and  distributed  recovery  procedures. 
In  this  protocol,  deadlocks  are  handled  by  the  centralized 
controller . 

Two  other  locking  and  deadlock  detection  protocols  for 
distributed  databases  are  presented  in  this  chapter  -  one  is 
hierarchically  organized  and  the  other  is  distributed.  A 
graph  model  which  depicts  the  state  of  execution  of  all 
transactions  in  the  system  is  used  by  both  protocols.  A  cy¬ 
cle  in  this  graph  is  a  necessary  and  sufficient  condition 
for  a  deadlock  to  exist.  Nevertheless,  neither  protocol  re¬ 
quires  that  the  global  graph  be  built  and  maintained  in  ord¬ 
er  for  deadlocks  to  be  detected.  In  the  case  of  the 
hierarchical  protocol,  the  communications  cost  can  be  optim¬ 
ized  if  the  topology  of  the  hierarchy  is  appropriately 
chosen . 
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Coordination  of  resources  in  a  distributed  environment 


exhibits  additional  complexity  over  resource  coordination  in 
centralized  environments  due  to: 


1.  possibility  of  crashes  of  participating  sites  and 
or  communication  links.  Occurrence  of  such 
failures  can  render  the  database  inconsistent  if 
not  appropriately  handled  by  the  coordination  al¬ 
gorithm  . 

2.  network  partitioning:  in  general,  it  is  not  possi¬ 
ble  to  distinguish  between  messages  which  could 
not  be  delivered  due  to  a  crash  of  the  recipient 
site  and  undelivered  messages  due  to  network  par¬ 
titioning.  Therefore,  network  partitioning  in  the 
more  general  sense  considered  here  is  not  simply  a 
matter  of  proper  network  topology  design.  It 
turns  out  that  detection  of  network  partitioning 
can  only  occur  at  network  reconnection  time. 

3-  inherent  communication  delay:  the  time  to  get  a 
message  through  a  computer  communication  network 
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may  be  arbitrarily  long,  although  finite.  There¬ 
fore  any  proposed  solution  should  operate  correct¬ 
ly  regardless  of  the  delay  experienced  by  any  mes¬ 
sage,  and  in  general  should  be  efficient. 

A  protocol  to  coordinate  concurrent  access  to  a  distri¬ 
buted  database  using  locking  is  presented  in  this  section. 
The  algorithm  has  as  its  core  a  centralized  locking  protocol 
with  distributed  recovery  procedures.  A  centralized  con¬ 
troller  with  local  appendages  at  each  site  coordinates  all 
resource  control,  with  requests  initiated  by  application 
programs  at  any  site.  Recovery  is  broken  down  into  three 
disjoint  mechanisms;  for  single  node  recovery,  merge  of  par¬ 
titions  and  reconstruction  of  the  centralized  controller  and 
tables . 

Among  the  properties  of  the  proposed  protocol  we  have: 

A-  robustness  in  the  face  of  crashes  of  any  partici¬ 
pating  site,  as  well  as  communication  failures,  is 
provided.  The  protocol  can  recover  from  any 
number  of  failures  which  occur  either  during  nor¬ 
mal  operation  or  during  any  of  the  three  recovery 
processes.  Recovery  is  done  in  such  a  way  that 


maximum  forward  progress  is  achieved. 


be  easily  integrated  given  the  centralized  control 
characteristic  of  the  proposed  algorithm. 

straightforward  integration  Eradicate  locking 
methods  [ESWA  763  Permitted .  Value  dependent 
lock  specification  at  the  logical  level  is  neces¬ 
sary  to  avoid  the  problems  of  "phantom  tuples" 
discussed  by  Eswaran  et  al  [ESWA  76].  Other  lock¬ 
ing  disciplines  may  also  be  easily  supported. 

,s.,g.,n,t,lnugd  local  Qjo.gr„a.tlon  in  the  face.  nX  n.gt.vtgr.K 
partitioning  is  supported  ■  The  locking  algorithm 
operates,  and  operates  correctly,  when  the  network 
is  partitioned,  either  intentionally  or  by  failure 
of  communication  lines.  Each  partition  is  able  to 
continue  with  work  local  to  it,  and  operation 
merges  gracefully  when  the  partitions  are  recon¬ 
nected  . 


P£.rfgnnanos  sJL  the  algorithm  a&g.s  not  sLe grads 
operations .  It  is  shown  in  a  later  section  of 
this  chapter  that  for  many  topologies  of  interest, 
the  delay  introduced  by  the  protocol  is  not  a 
direct  function  of  the  size  of  the  network.  The 
communication  cost  is  shown  to  grow  in  a  relative¬ 
ly  slow,  linear  fashion  with  the  number  of  sites 


participating  in  a  given  transaction. 


of  the  failures  mentioned  before  can  be  proven  in 


a  straightforward  way. 

The  protocol  presented  in  this  section  is  described  in 
an  intuitive  manner  in  the  next  section,  followed  by  a  more 
detailed  description  in  the  two  subsequent  ones.  An  algo¬ 
rithmic  specification  of  this  locking  protocol  can  be  found 
in  [MENA  77].  An  informal  proof  of  the  correctness  of  the 
algorithm  is  presented  here.  The  proof  is  decomposed  into 
five  major  parts,  one  for  normal  operation,  three  for  the 
recovery  phases,  and  a  last  part  that  shows  the  part3  actu¬ 
ally  can  be  proved  disjointly. 

A  proposal  for  an  extension  aimed  at  optimizing  opera¬ 
tion  of  the  algorithm  to  adapt  to  highly  skewed  distribu¬ 
tions  of  activity  is  presented  here.  The  extension  applies 
nicely  to  interconnected  computer  networks. 


However,  messages  may  have  to  be  retransmitted  many  times 
until  they  get  through  the  net.  An  implication  of  these  as¬ 


sumptions  is  that  messages  may  be  delayed  by  an  arbitrary 
but  finite  amount  of  time.  We  also  assume  that  messages 
from  a  source  site  A  are  delivered  by  the  network  protocols 
to  a  destination  site  B  in  the  same  order  they  were  generat¬ 
ed.  However,  we  make  no  assumptions  about  the  order  in 
which  messages  from  two  distinct  sources  are  received  by  a 
third  one.  We  require  that  the  network  routing  procedures 
be  such  that  every  pair  of  nodes  can  communicate  with  each 
other  if  the  necessary  physical  connection  is  available. 

User  interaction  with  the  database  is  done  through  ap¬ 
plication  programs,  APs,  which  communicate  with  the  Data 
Base  Management  processes.  Of  those  processes,  two  are  of 
interest  for  this  locking  protocol:  the  centralized  lock 
<?0n.fcr,9ll.ftr  or  simply  lock  controller  and  the  local  lock 
9,9h^r9ll9r  • 

As  a  first  approximation  assume  that  there  is  only  one 
lock  controller  or  LC  for  the  entire  network.  This  process 
is  responsible  among  other  things  for  examining  lock  and 
lock  release  requests  from  the  APs,  and  deciding  whether 
they  should  be  granted  or  not.  For  this  purpose,  the  LC 
maintains  a  table  called  the  LOCK  table,  which  is  a  set  of 
all  the  active  locks.  Each  entry  in  this  table  is  a  3-tuple 
of  the  form  (H,T,P)  where  H  is  a  unique  host  identification, 
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T  is  a  unique  database  transaction  identifier  within  each 
site  and  P  is  a  description  of  the  logical  portion  of  the 
database  to  be  locked  as  well  as  the  lock  mode  (e.g.,  read, 
write,  etc.).  In  a  relational  database,  the  lock  specifica¬ 
tion  may  for  example  be  a  predicate  lock  as  described  by 
Eswaran  et  al  [ESWA  76].  Note  that  the  pair  (H,T)  is  a  glo¬ 
bally  unique  transaction  identifier. 


At  every  site,  except  for  the  one  where  the  LC  is  lo¬ 
cated,  there  is  a  local  lock  controller  or  LLC.  Those 
processes  are  responsible  for  maintaining  a  local  copy  of 
the  relevant  portion  of  the  LOCK  table.  The  relevant  por¬ 
tion  of  the  LOCK  table  is  the  set  of  entries  which  contain 
locks  which  refer  to  data  stored  locally.  Any  LLC  may  be¬ 
come  the  lock  controller  whenever  there  is  a  crash  in  the 


system  which  makes  the  LC  unavailable.  The  recovery  process 


is  explained  later  in  detail.  Each  time  a  transaction  takes 
an  action  the  local  copy  of  the  LOCK  table  is  examined  to 


determine  whether  the  action  can  be  performed  or  not. 


Therefore,  there  are  two  reasons  for  keeping  a  local  copy  of 
the  LOCK  table,  namely:  resilience  to  failures  and  local  ac¬ 


tion  checking. 


It  is  convenient  at  this  point  to  introduce  the  notion 


of  logical  partition  or  logical  component,  as  opposed  to 


that  of  a  physical  component.  A 


is  a 


maximal  subnetwork  such  that  every  pair  of  sites  in  the  com- 
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ponent  can  communicate  with  one  another.  It  can  be  readily 
seen  that  the  composition  of  a  physical  component  is  not 
under  the  control  of  the  locking  protocol,  since  nodes  and 
communication  links  fail  independently  of  the  protocol 
operation.  Such  a  lack  of  control  could  make  the  operation 
of  the  protocol,  in  the  face  of  crashes,  rather  complex. 
The  concept  of  logical  component  is  introduced  to  give  the 
protocol  independence  from  unexpected  changes  in  the  compo¬ 
sition  of  each  physical  component.  To  this  end,  each  1C 
keeps  a  list  of  sites  which  he  thinks  are  still  up,  called 
the  up  list.  A  logical  component  is  defined  as  being  the 
subnetwork  indicated  by  the  nodes  which  are  in  the  up  list. 
This  list  may  lag  behind  the  list  of  sites  which  are  actual¬ 
ly  up.  Independence  from  the  composition  of  physical  com¬ 
ponents  is  thus  achieved  by  controlling  the  way  by  which  the 
latter  list  maps  into  the  former,  in  a  way  which  is  ex¬ 
plained  later  In  this  chapter. 

Since  one  of  our  stated  goals  is  to  allow  local  opera¬ 
tions  to  continue  in  face  of  network  partitioning  and  to  al¬ 
low  partitions  to  merge  gracefully,  it  is  necessary  for  each 
partition  to  have  its  own  LC.  There  is  one  LC  for  each  log¬ 
ical  component. 

The  operation  of  the  locking  protocol  under  no  crash 
conditions  can  be  intuitively  explained  as  follows.  The  LC 
receives  lock  and  lock  release  requests  from  the  application 
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programs.  The  LC  then  assigns  a  sequence  number  to  the  re¬ 
quest.  These  sequence  numbers  are  taken  from  a  monotonical- 
ly  increasing  sequence  of  numbers  and  will  be  used  for  the 
purposes  of  crash  recovery  as  will  become  clear  in  a  later 
section.  Each  request  (including  its  sequence  number)  is 
then  sent  to  all  relevant  LLCs  in  the  component.  An  LLC  is 
relevant  with  respect  to  a  request  if  its  site  stores  data 
addressed  by  the  request  in  question.  The  request  is  stored 
in  a  pending  list  at  each  LLC  site  and  an  acknowledgment  is 
sent  back  to  the  LC .  After  the  acknowledgment  from  all 
relevant  sites  in  the  component  is  received  (excluding  those 
which  crashed  in  the  meantime)  a  confirmation  for  the  re¬ 
quest  is  sent  by  the  LC  to  all  LLCs  causing  the  request  to 
be  deleted  from  the  pending  list  and  appended  to  the  LOCK 
table  . 

A  lock  request  may  be  rejected  by  a  LC  if  it  conflicts 
with  other  locks  in  the  LOCK  table  or  in  the  pending  list  or 
if  the  request  is  not  local  to  the  component.  We  assume 
that  the  LC  is  able  to  determine  for  each  lock,  P,  the  set, 
LOC(P),  of  sites  where  the  data  to  be  locked  are  stored. 
Thus,  a  lock  P  is  said  to  be  local  if  LOC(P)  is  contained  in 
the  up  list  for  the  component.  The  set  LOC(P)  can  be  deter¬ 
mined  by  the  LC  by  checking  some  catalogs.  The  organization 
of  those  catalogs  is  not  relevant  here;  see  [CHU  76]  and 
[STON  76]  for  discussions  of  that  subject. 


Every  time  that  a  site  or  a  set  of  sites  drop  out  of 


the  up  list,  all  the  transactions  which  had  at  least  one 


lock  which  became  not  local  will  be  aborted  or  backed  up. 


All  the  locks  held  by  these  transactions  are  released.  In 


this  \'ay  complete  locality  of  operations  is  enforced  by  the 


CLC  protocol 


If  the  LC  crashes  or  becomes  unavailable  a  recovery 


mechanism  called  Logical 


( LCH)  takes 


place.  As  soon  as  an  LC-crash  is  detected  by  any  process 


engaged  in  a  conversation  or  exchange  of  messages  with  the 


lock  controller,  a  new  process  is  nominated  to  be  the  new 


LC.  There  is  a  globally  known  circular  ordering  of  the 


sites  from  which  the  nominee  is  selected.  If  the  nominee  is 


up  it  accepts  the  nomination  by  sending  a  message  which 


circulates  through  all  the  sites  in  the  component.  The  pur¬ 


pose  of  this  message  is  also  to  collect  all  the  requests 


which  have  been  received  by  all  the  relevant  sites  but  which 


are  still  in  the  pending  list  for  at  least  one  of  these 


sites.  Those  requests  will  be  incorporated  into  the  LOCK 


table  at  every  site  in  a  subsequent  phase  of  the  recovery 


process.  In  summary,  the  LCR  mechanism  amounts  to  electing 


a  new  LC  for  the  component  and  updating  all  the  LOCK  tables 


appropriately  before  normal  operation  is  resumed.  Various 


race  conditions  are  dealt  with  by  the  details  of  the 


recovery  protocol 
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It  is  the  responsibility  of  each  LC  to  periodically 
monitor  the  connection  between  it  and  a  node  not  in  its  up 
list.  If  a  physical  connection  between  two  previously  logi¬ 
cally  disconnected  component  is  detected,  a  Logical 
Component  Merge  ( LCM )  mechanism  is  started.  LCM  is  always 
done  pairwise  between  components  and  in  this  process  the  LC 
of  one  of  the  components  plays  an  active  role  while  the  oth¬ 
er  plays  a  passive  one.  The  first  phase  of  LCM  is  composed 
of  an  interconnection  protocol  by  which  two  LCs  are  logical¬ 
ly  connected  in  such  a  way  that  one  of  them  is  designated 
active  and  the  other  passive.  This  protocol  also  enforces 
the  pairwise  merge  condition  and  is  shown  to  be  deadlock 
free.  After  a  logical  connection  has  been  established  both 
LCs  clear  all  outstanding  requests  and  reject  further  ones. 

I  In  the  subsequent  phase,  the  union  of  the  LOCK  tables  of  the 

two  components  is  made  and  the  new  LOCK  table  is  sent  to  all 
the  sites  in  both  components  in  the  form  of  a  message  which 
circulates  through  them.  This  message  signals  the  comple¬ 
tion  of  the  merge.  The  active  LC  becomes  the  lock  controll¬ 
er  for  the  new  logical  component. 

When  a  site  which  was  down  recovers,  it  is  made  active 
by  the  Single  Node  Recovery  (SNR)  mechanism  which  basically 
amounts  to  the  acquisition  by  that  site  of  a  new  copy  of  the 


relevant  portion  of  the  LOCK  table. 


The  three  recovery  mechanisms  described  above  do  not 
interact  with  each  other,  as  will  be  shown  later.  This  pro¬ 
perty  is  important  because  it  allows  us  to  decompose  the 
correctness  proofs  into  a  proof  of  disjointness  and  then 
proofs  for  each  recovery  procedure  separately. 

The  recovery  mechanisms  will  be  shown  to  be  robust  in 
the  face  of  additional  failures.  In  order  to  achieve  this 
goal,  each  mechanism  is  designed  in  such  a  way  that  a  par¬ 
tial  execution  of  any  of  the  recovery  algorithms  does  not 
destroy  any  of  the  properties  we  want  to  prove  about  them. 


It  is  important  to  emphasize  at  this  point  that,  since 
all  the  lock  requests  are  examined  by  a  centralized  lock 
controller  in  one  logical  component,  locks  granted  by  an  LC 
do  not  conflict  with  one  another.  This  fact  enables  us  to 
consider  the  operation  of  the  algorithm  for  normal  operation 
and  for  recovery  as  if  there  were  only  one  lock  per  logical 
component.  The  reader  is  encouraged  to  keep  this  in  mind  as 
he  reads  through  this  section. 


1-Z-Z  -  Lock  oo£  B.e lease  Granting  Algorithms 


This  section  describes  informally  the  algorithms  used 
to  grant  new  locks  and  to  release  existing  ones.  One  would 
like  those  algorithms  to  have  the  property  that  a  lock  is 
either  granted  or  released  if  and  only  if  it  is  known  to  all 
the  relevant  sites.  The  basic  structure  of  both  algorithms 
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can  be  abstracted  in  what  we  call  the  Assured  Communication 
Protocol  ( ACP )  which  exhibits  the  desired  property  outlined 
below . 

Let  there  be  a  sender  S,  who  wishes  to  send  a  message 
M,  originated  at  an  external  source  ES,  to  ji  destinations 
DI  ,  D2,  Dn .  Each  site  i  keeps  two  message  buffers: 

temp_buf f er ( i )  and  final_buf fer ( i ) .  ACP  is  such  that  mes¬ 
sage  M  will  only  be  in  f inal_buf f er ( S )  if  M  is  either  in 
temp_buf f er ( Di )  or  f inal_buf f er ( Di )  for  all  destinations  Di . 
ACP  can  be  described  by  the  following  set  of  rules: 

1.  S  receives  a  "MESSAGE  REQUEST"  or  MR  message  from 
ES  and  broadcasts  an  "ACCEPT  MESSAGE"  or  AM  mes¬ 
sage,  which  contains  M,  to  all  Di*s,  i=1,...,n. 
The  message  M  is  placed  in  temp_buf f er ( S ) . 

2.  When  an  AM  message  is  received  by  a  destination 
Di ,  the  message  M  is  placed  in  temp_buf f er ( Di )  and 
a  "MESSAGE  ACCEPTED"  or  MA  message  is  sent  back  to 
S. 

3.  When  all  the  MA  messages  have  been  received  by  S, 
M  is  moved  to  f inal_buf fer( S)  and  removed  from 
temp_buf f er ( S )  and  a  "CONFIRM  MESSAGE"  or  CM  mes¬ 
sage  is  broadcast  to  all  destinations. 


The  receipt  of  a  CM  message  at  destination  Di 


causes  M  to  be  moved  into  f inal_buf fer( Di )  and  re¬ 
moved  from  temp_buf f er ( Di )  . 

A  variant  to  this  protocol,  called  a  two-phase  commit 
protocol,  is  described  in  [GRAY  78]  and  [LAMP  76].  The 
two-phase  commit  protocol  has  an  additional  acknowledgment 
message  at  the  end  from  each  destination  to  the  sender. 
However,  we  show  here  that  this  additional  message  is  un¬ 
necessary.  Therefore,  the  ACP  protocol  has  33  percent  less 
messages  than  the  two-phase  commit  protocol.  The  Assured 
Communications  Protocol  is  illustrated  in  Figure  3«1*  1° 
this  figure,  the  horizontal  arrows  are  labeled  with  message 
names.  There  is  one  vertical  line  corresponding  to  the 
sender  S  and  another  corresponding  to  destination  Di .  Two 
graphical  notations  were  used  in  this  picture.  Namely,  the 
diverging  arrows  shown  in  point  A  of  the  vertical  axis  for 
the  sender  S  indicate  that  the  message  in  question  is  being 
broadcast  to  every  one  of  the  destination  sites.  The  con¬ 
verging  arrows  shown  in  point  B  of  the  same  vertical  axis 
indicate  that  the  sender  must  wait  until  it  receives  all  the 
messages  from  all  the  sites  which  are  up.  Previously  up 
sites  may  be  pronounced  down  by  the  underlying  network  pro¬ 
tocols  after  timeout  and  retransmission  took  place  several 
times . 

Several  other  details  are  also  worth  keeping  in  mind. 
As  mentioned  before,  each  LC  keeps  a  list  of  the  sites  in 
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the  requester  wait.  That  decision  is  not  the  concern  of 
this  section.  If  there  are  no  conflicts  and  the  lock  is  lo¬ 
cal  to  the  component  the  LC  must  notify  every  relevant  LLC 
in  its  component  that  a  new  entry  should  be  appended  to 
their  LOCK  tables.  Actually,  instead  of  inserting  the  lock 
directly  into  the  LOCK  table,  an  LLC  appends  it  to  a  list  of 
pending  lock  requests,  called  an  L-list .  The  reason  for 
this  is  to  prevent  copies  of  the  LOCK  table  from  becoming 
inconsistent  if  the  LC  crashes. 

The  basic  structure  of  the  Lock  Granting  and  Lock 
Releasing  algorithms  is  the  same  as  that  of  the  ACP  proto¬ 
col,  where  AP,  LC ,  LLCi  and  LOCK  table  correspond  to  ES ,  S, 
Di  and  final_buffer  in  ACP,  respectively.  Also,  the  message 
M  in  ACP  should  be  considered  as  a  lock  request  for  the  Lock 
Granting  algorithm  and  as  a  release  request  for  the  Lock 
Releasing  one.  For  the  Lock  Granting  Algorithm,  in  particu¬ 
lar,  temp_buffer  corresponds  to  an  L-list. 

-  kagfc  Releasing  Algorithm 

A  similar  procedure  is  followed  when  an  AP  issues  a 
lock  release  request,  by  sending  to  the  LC  a  "RELEASE  RE¬ 
QUEST"  or  RL  message.  Each  site  keeps  a  list  of  pending 
release  requests  or  an  R-llst  for  the  same  reasons  we  intro¬ 
duced  the  L-list.  The  R-list  corresponds  to  temp_buffer  in 
the  ACP  protocol. 
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The  operation  of  the  Look  Granting  and  Lock  Releasing 
algorithms  can  be  described  by  a  pair  of  interacting  graphs 
or  state  transition  diagrams  as  shown  in  figure  3-2.  This 
technique  was  introduced  by  Zafiropulo  in  [ZAFI  77a]  as  a 
tool  for  modelling  computer  communications  network  proto¬ 
cols.  There  is  one  state  transition  diagram  (STD)  for  the 
LC  (figure  3-2. a)  and  one  for  a  local  lock  controller  LLCi 
(figure  3-2. b).  State  transitions  are  triggered  by  events 
such  as  message  transmission  or  reception.  Each  STD  has  a 
quiescent  or  initial  state .  As  already  mentioned  in  section 
3.2.1,  the  LC  does  not  grant  conflicting  locks,  so  that  we 
can  consider  the  operation  of  the  protocol  as  if  there  were 
only  one  lock  request.  Let  &  be  such  a  lock.  It  is  useful 
to  associate  with  each  state  a  in  the  STD  a  3-tuple  (a,b,c) 
where  a,  b  and  c  are  binary  variables  which  indicate  whether 
the  lock  x  is  in  the  LOCK  table,  in  the  L-list  or  the  R-list 
respectively  when  the  protocol  is  in  state  £.  For  instance, 
the  tuple  (1,0,1)  indicates  that  the  lock  jt  is  in  the  LOCK 
table  and  in  the  R-list.  Labels  on  the  state  transition 
arcs  represent  conditions  or  events  upon  which  the  transi¬ 
tion  takes  place.  These  conditions  indicate  either  message 
transmission  or  reception.  The  following  abbreviations  for 


message  names  were  used  in  the  diagram  of  figure  3*2: 


LR:  LOCK  .BE QUEST 


i  RR:  1ELEASE  .REQUEST 

i 

i 

|  From  L£  la  A £: 

!  LG:  LOCK  GRANTED 

j  RG:  RELEASE  C.R  ANTED 

Among  LC  and  LLCs : 

AL:  ACCEPT  LOCK 
LA:  LOCK  ACCEPTED 
CL:  CONFIRM  LOCK 
AR:  ACCEPT  .RELEASE 
RA:  .RELEASE  ACCEPTED 
CR:  .CONFIRM  .RELEASE 

Thsre  are  state  transitions  in  one  STD  which  have  l 
companion  in  the  other  STD,  i.e.  transmission  of  a  message 
in  one  of  them  and  reception  of  the  same  message  in  the  oth¬ 
er.  We  label  the  transitions  with  "signed"  message  names. 
A  positive  sign  represents  a  message  reception  and  a  nega¬ 
tive  one  indicates  a  message  transmission. 

Let  a  transition  cycle  be  a  path  beginning  and  ending 
in  the  initial  state.  Let  a  conversation  [A,B]  be  a  pair  of 
transition  cycles  A  and  B  such  that  each  of  them  belong  to 
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different  STDs.  Let  us  define  an  event  sequence  of  a  tran¬ 
sition  cycle  A  as  the  sequence  of  labels  of  the  transitions 
in  A  which  correspond  to  internal*  events  only.  A  conversa¬ 
tion  [A,B]  is  said  to  be  synchronized  if  the  event  sequence 
of  A  is  equal  to  the  event  sequence  of  B  except  for  the 
signs  in  the  events  which  are  reversed. 

From  the  description  of  the  protocol  one  can  easily 
draw  the  STDs  in  figure  3.2.  Actually  an  STD  may  be  con¬ 
sidered  as  a  protocol  specification.  Let  us  define  a  global 
state  S  for  a  set  of  ]£  STDs,  STD1  through  STDk,  as  a  k- 
tuple  (si,  s2,  ....  sk)  where  si,  for  i  C  {1,...,k},  is  the 
3-tuple  associated  with  state  si  in  STDi .  A  global  state  is 
said  to  be  feasible  if  the  individual  component  states  for 
each  STD  can  coexist. 

If  all  the  conversations  of  a  protocol  are  synchronized 
then  the  task  of  finding  the  set  of  all  global  feasible 
states  is  fairly  straightforward.  Consider  the  synchronized 
conversation  [ A ,  B  ]  shown  in  figure  3.3  below. 

Let  -m  be  a  condition  on  the  transition  cycle  A  which 
triggers  the  transition  from  state  s[a,i]  into  state 
s[a,i+1].  Also,  let  +m  be  the  companion  condition  in  the 
transition  cycle  B  which  triggers  the  transition  from  state 

*  An  event  is  internal  if  it  does  not  represent  an  external 
action  such  as  an  exchange  of  messages  with  an  application 
program . 


Figure  3>3  -  Synchronized  Conversation  [ A  ,  B  ]  . 
The  dashed  lines  indicate  global  feasible  states 
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already  received. 


2.  { (s[ b  ,  J  + 1 ] ,x)  I  x  S  R  2 } :  these  states  are  not 
feasible  because  message  m''  was  not  yet  sent  but 
already  received. 


3-  state  ( s[ b  ,  j+ 1 ] , s[ a  ,  i+ 1  ] )  is  feasible 


This  example  illustrates  the  rule  for  generating  the 
set  of  all  the  global  feasible  states  in  a  protocol  where 
all  the  conversations  are  synchronized.  Namely,  for  every 
non-external  transition  a  add  to  the  set  of  global  feasible 
states  the  states  ( s[ a  ,  i]  ,  s[ b  ,  J ] ) ,  ( s[ a  ,  i+ 1 ] , s[ b , J ] )  and 
( s[ a , 1+ 1 ] , s[ b , J+ 1 ] ) .  If  any  of  the  STDs  contains  transi¬ 
tions  which  are  triggered  by  external  events  (e.g.  an  exter¬ 
nal  request  from  an  application  program),  then  the  folllow- 
ing  rule  must  be  added.  Let  Y(s)  be  the  set  of  all  the 
states  which  can  be  reached  from  state  a.  through  a  path  con¬ 
taining  transitions  associated  with  external  events  only. 
For  every  non-external  transition  a  add  to  the  set  of  global 


feasible 


states 


states 


(  s[  a  ,  i]  ,  s[  b  ,  j  ] )  , 


( s[ a , i+ 1 ] , s[ b  ,  j ] )  ,  ( s [ a , i+ 1 ] , s[ b , j  +  1  ] )  and  the  sets 


{  ( s [ b , j ] , q )  !  q  C  Y  (  s [ a , i+  1  ] )  }, 


{  ( s[b,  J+ 1  ] , q)  I  q  6  Y(s[a,i+1j)  }  and 


{  ( s[ a  ,  i+ 1 ] , q )  |  q  C  Y ( s[ b  ,  j+  1  ]  )  }. 
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Figure  3-4  illustrates  these  rules. 


Using  the  rules 


derived  above,  one  obtains  the  set  of  feasible  states  for 
the  grant  and  release  lock  protocols  in  a  system  with  an  LC 

FEASIBLE 

GLOBAL  STATES  (LC  STATE;  LLCi  STATE) 


( s2 , s  1  1  )  ! 

1 

1 

O  1 

O  1 

1 

1 

O  l 

O  1 

O  1 

1 

• 

1 

1 

» 

1 

( s3 , s  1 2 )  ! 

(0,1,0 

0,1,0) 

( s2 , s  1 6 )  ! 

(0,1,0 

1,0,1) 

( s4 , s 1 3 )  ! 

(1,0,0 

0,1,0) 

( s5 , s 1 4 ) 

(1,0,0 

1  ,0,0) 

( s7 , s 1 3)  ! 

(1,0,1 

0,1,0) 

(  s7  ,  s  1  4  ) 

(1,0,1 

1,0,0) 

(  s8 , s 1 5 )  ! 

(1,0,1 

1,0,1) 

( s9 , s 1 6  )  ! 

(0,0,0 

1,0,1) 

(sio.sll)  ! 

(0,0,0 

0,0,0) 

TABLE  1  -  Feasible  Global 
States 

and  one  LLC.  These  states  are  shown  in  Table  1.  In  gen¬ 
eral,  a  global  state  for  a  system  with  an  LC  and  &  LLCs  is 
of  the  form  of  S  =  (X;  xl ,  x2,  ...»  xk)  where  X  is  an  LC- 
state  and  xi,  for  i=1,  ...,k,  is  an  LLC  state  compatible 
with  the  LC  state  X,  according  to  table  1.  For  instance,  if 
X  =  s7  *  (1,0,1)  the  three  following  are  examples  of  global 
states  when  we  have  three  LLCs:  ( s7  ;  s 1 3 . s 1 4 , s  1  3 )  ,  (s7; 
s13,s13,sl4)  and  (s7;  s  1  3  ,  s  1  3  ,  s  1  3 )  •  The  next  section  gives 
some  definitions  and  proofs  regarding  the  global  feasible 
states  of  the  LOCK  tables,  L-lists  and  R-lists. 


We  will  show  here  that,  if  no  crash  occurs,  the  Lock 


Granting  and  Lock  Releasing  algorithms  have  the  property 
that  a  lock  is  only  granted  or  released  if  all  the  sites  in 
the  component  know  about  the  request.  In  order  to  make  this 
statement  more  precise  consider  the  following  definitions. 
Let  LT ( i ) ,  L ( i )  and  R(i)  be  the  LOCK  table,  L-list  and  R- 
list  at  site  i  respectively. 

DEFINITION  2-1  ( L-OC-k  5.3.3. U g.aA.  f-Eeg<?n<?e. )  :  A  lock  request  £  or 
a  lock  is  said  to  be  present  at  site  i  if  x  6  [ L  T ( i )  U  L(i)] 
-  R(i)  . 

DEFINITION  2.-Z  ( Release  Request  Presence)  :  Let  i  be  a  lock 
request  and  let  £  be  its  associated  lock  release  request. 
It  is  said  that  £  is  present  at  site  i  if  y  C  R(i)  or  if  x  IS 
LT ( i )  U  L(i)  . 

It  is  convenient  at  this  point  to  define  precisely  the 
meaning  of  a  site  being  relevant  to  a  lock.  We  say  that 
site  i  is  relevant  to  lock  x  if  at  least  one  of  the  data 
items  addressed  or  covered  by  x  are  stored  at  site  i.  Let 
us  define  S(x)  as  the  set  of  sites  relevant  to  the  lock  x. 
We  can  now  make  the  following  statements. 

ASSERTION  2.1:  If  a  lock  £  €  LT(LC)  and  if  there  is  no  pend¬ 
ing  release  request  associated  with  £  ,  then  the  lock  re¬ 
quest  £  is  present  at  every  site  in  S(x). 


ASSERTION  2.'2.:  If  a  lock  &  3  LT(LC)  and  if  there  is  no  pend¬ 
ing  lock  request  associated  with  £,  then  the  release  lock 
request  jt  associated  with  x  is  present  at  every  site  in 


K 


3  (  X )  . 

The  proof  for  these  two  assertions,  as  well  as  for  all 
other  assertions  in  this  chapter,  can  be  found  in  Appendix  A 
of  this  dissertation.  Together,  they  lead  directly  to  the 
following  result. 

THEOREM  3.1:  Let  C  be  a  logical  component,  LC  its  lock  con¬ 
troller  and  U  the  set  of  sites  in  C.  If  no  crashes  ever  oc¬ 
cur  then  a  lock  request  is  only  granted  by  the  LC  after  it 
is  present  at  all  the  relevant  sites  in  U  and  a  lock  is  only 
released  if  the  associated  release  request  is  present  at 
every  relevant  site  in  U. 

On  a  later  section  the  results  in  this  section  will  be 
extended  to  deal  with  situations  in  which  crashes  can  occur. 

l-l-l  -  Sxa9h  Recovery 

So  far  we  have  described  the  protocol  for  requesting 
locks  and  releasing  them,  assuming  that  no  crash  occurred. 
Communication  links,  processors,  operating  systems  and 
processes  are  some  examples  of  sources  of  crashes. 

The  three  already  mentioned  recovery  mechanisms  will  be 
presented  here.  These  mechanisms  will  be  proven  to  be 

5  1 


robust  with  respect  to  additional  failures.  To  be  robust, 
the  protocols  must  preserve  logical  component  internal  and 
mutual  consistency  as  defined  below,  if  any  changes  have 
been  made  to  any  permanent  information  (like  LOCK  tables,  up 
lists  or  LC  id's)  at  any  node. 


1-1  (LX-J 


:  The  set  of  LOCK  tables  of 


a  Logical  Component  is  said  to  be  LT-consistent  if  asser¬ 
tions  3.1  and  3-2  hold  at  any  time. 


logical  component  is  said  to  be  internally  consistent  if  the 
set  of  its  LOCK  tables  is  LT-eonsistent  and  if  there  is  one 
and  only  one  LC,  whose  identity  is  known  to  every  node  in 
the  component. 


Mutual 


:  Aset 


of  logical  components  is  said  to  be  mutually  consistent  if 
all  of  them  are  internally  consistent  and  if  there  is  no 
lock  present  at  any  LOCK  table  of  one  of  them  which  con¬ 


flicts  with  another  such  lock  of  any  other  component. 


Definition  3.5  covers  the  previous  two,  and  specifies 
an  important  property  which  is  required  of  recovery. 


The  recovery  protocols  have  been  designed  so  that  all 
crashes  which  can  occur  during  a  recovery  phase  fall  into 
one  of  the  two  disjoint  classes,  which  we  call  terminal  and 
transparent  failures. 


V.V.V.V/AV.V.A-.vv.vv.v.y'^ 
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// 


A  terminal  crash  causes  the  entire  recovery  mechanism 
to  be  aborted  and  restarted.  The  possible  conditions  under 
which  terminal  crashes  occur  are  3hown  to  leave  the  protocol 
in  a  robust  state,  as  defined  above.  A  transparent  crash  is 
defined  to  be  one  which  does  not  affect  the  continued 
correct  operation  of  the  recovery  process. 


The  following  is  the  definition  of  robustness  used  in 
this  chapter.  A  recovery  protocol  is  robust  if  all  crashes 
can  be  shown  to  be  either  terminal  or  transparent.  As  we 
will  see,  for  each  of  the  recovery  mechanisms,  we  can  iden¬ 
tify  a  point  before  which  the  recovery  can  be  considered  as 
not  having  happened  at  all  and  after  which  it  is  considered 
to  be  successfully  carried  out.  This  point  is  called  the 
completion  point .  Crashes  before  the  completion  point,  if 
they  have  any  effect  at  all,  are  shown  to  be  terminal. 
Crashes  after  the  completion  point  are  shown  to  be  tran¬ 
sparent  . 


The  three  proposed  recovery  mechanisms  will  be  shown  to 
occur  disjointly  in  time.  In  other  words,  a  merge  of  two 
logical  components  only  takes  place  if  both  are  in  their 
normal  state  or  are  not  recovering  from  a  Logical  Component 
Crash.  Also,  a  site  only  becomes  attached  to  a  logical  com¬ 
ponent  if  this  component  is  in  its  normal  state.  These  im¬ 
portant  properties  will  allow  us  to  state  and  prove  separate 
theorems  concerning  each  one  of  them. 


1 
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We  will  now  show  how  an  LLC  may  become  an  LC  if  the  LC 
crashes.  A  crash  of  the  LC  can  be  detected  by  any  process 
engaged  in  a  conversation  or  exchange  of  messages  with  it. 
As  an  example,  an  AP  may  time-out  while  waiting  for  a  reply 
from  the  LC  for  a  lock  or  lock  release  request.  In  every 
case,  the  process  which  detects  a  crashed  LC  is  responsible 
for  nominating  a  new  LC .  For  this  purpose,  we  will  assume 
that  the  distinct  sites  or  nodes  in  the  underlying  network 
are  arranged  in  a  linear  order  such  that  node  # i  precedes 
node  #(i+1)  mod  n.  Let  this  order  be  called  the  nomination 
order .  So,  whenever  a  process  detects  a  failed  LC  it  nom¬ 
inates  the  next  node  which  is  up  in  the  nomination  order 
after  the  crashed  LC  to  the  position  of  LC .  This  nomination 
is  accomplished  by  the  issue  of  an  "ACCEPT  NOMINATION"  or  AN 
message  by  the  nominator.  If  this  message  is  not  ack¬ 
nowledged  after  a  certain  number  of  times  it  has  been  re¬ 
transmitted,  the  nominator  assumes  that  the  nominee  is  down 
and  sends  an  AN  message  to  the  next  site  in  the  nomination 
order.  However,  it  may  be  the  case  that  the  originally  nom¬ 
inated  node  was  not  down,  as  assumed  by  the  nominator,  but 
that  due  to  certain  conditions  in  the  network  its  reply  was 
seriously  delayed.  So,  it  seems  that  more  than  one  LC  could 
be  nominated  in  this  process!  Let  us  neglect  this  issue  for 
the  moment,  while  we  describe  the  recovery  procedure,  and 
show  later  how  such  an  undesirable  situation  can  be  easily 


,  *  «  •  *  •  *  *  < 


avoided.  The  nominee  is  first  responsible  for  checking  that 


the  old  LC  is  actually  dead  (since  the  nomination  may  have 
come  from  an  errant  AP).  Then  the  nominee  must  notify  every 
other  site  that  it  has  accepted  the  nomination.  Moreover, 
the  nominee  must  make  sure  that  all  the  copies  of  the  LOCK 
table  be  appropriately  updated.  From  now  on,  we  will  refer 
to  the  crashed  LC  as  the  'old  LC '  and  to  the  nominee  as  the 
' new  LC ' . 

The  process  by  which  the  new  LC  becomes  the  actual  LC 
can  be  divided  into  two  phases:  a  'notification  phase'  and  a 
'LOCK  table  update  phase'. 

In  the  notification  phase  all  the  nodes  in  the  com¬ 
ponent,  as  indicated  by  the  up  list  U,  are  informed  of  the 
identity  of  the  new  LC .  Also,  in  this  phase  enough  informa¬ 
tion  is  gathered  in  order  to  appropriately  update  the  LOCK 
tables  in  the  subsequent  phase.  The  update  of  the  LOCK 
tables  is  done  in  such  a  way  that  the  maximum  forward 
progress  is  obtained  at  the  end  of  the  LCR  procedure.  More 
precisely,  in  terms  of  the  STDs  introduced  in  section 
3. 2. 2. 3,  the  LCR  mechanism  leaves  the  system  in  the  global 
state  which  would  have  been  reached  by  the  system  if  a  crash 
had  not  occurred  and  if  no  new  requests  were  submitted. 

The  new  LC ,  upon  nomination,  will  issue  a  message 
called  "NOMINATION  ACCEPTED".  This  message  will  circulate 
once  through  the  set  of  all  sites  in  U  (including  the  site 


where  the  new  LC  runs)  in  a  predetermined  order.  Notice 
that  the  up  list  U  of  the  nominated  LC  determines  the  logi¬ 
cal  component  to  be  restored  by  the  Logical  Component 
Recovery  procedure. 

During  the  NA  cycle,  two  sets  will  be  constructed, 
namely  the  set  j*  of  locks  to  be  added  to  all  the  LOCK  tables 
and  the  set  jj,  of  locks  to  be  deleted  from  all  the  LOCK 
tables.  The  set  L  includes  all  the  locks  which  are  in  at 
least  one  lock  pending  list  and  that  would  therefore  end  up 
in  all  LOCK  tables  if  no  crash  had  occurred.  The  set  R  in¬ 
cludes  all  the  locks  which  are  in  at  least  one  lock  release 
list  and  would  therefore  be  deleted  from  all  LOCK  tables  in 
normal  conditions. 

It  is  possible  for  a  given  lock  to  be  in  the  L-list  at 
one  site  and  at  the  R-list  at  another  site.  This  situation 
can  be  seen  from  Table  1  and  is  also  illustrated  by  the  fol¬ 
lowing  scenario.  Consider  an  LC  and  two  LLC ,  LC1  and  LC2. 
Consider  a  lock  i  which  was  accepted  by  LC1  and  LC2  already. 
The  LC  then  enters  the  lock  into  its  LOCK  table  and  sends 
the  CONFIRM  LOCK  (CL)  message  to  LC 1  and  to  LC2 .  LC 1  re¬ 
ceives  the  message  and  moves  the  lock  into  its  LOCK  table. 
Now,  the  LC  receives  a  release  request  for  the  same  lock. 
It  then  sends  an  ACCEPT  RELEASE  (AR)  message  to  LC1  and  LC2. 
LC1  receives  it  and  enters  the  request  into  its  R-list.  LC2 
did  not  receive  neither  the  CL  nor  the  AR  messages  so  far 


and  therefore  the  lock  £  is  still  in  its  L-list. 

When  constructing  the  sets  L  and  R  one  has  to  be  able 
to  decide  whether  a  lock  should  be  installed  in  all  the  LOCK 
tables  or  deleted  from  all  of  them  in  those  cases  in  which 
the  request  appears  in  both  L-lists  and  R-lists.  This  deci¬ 
sion  is  easily  done  with  the  aid  of  the  sequence  numbers 
which  the  LC  attaches  to  every  request.  The  greater  the  se¬ 
quence  number  the  later  is  the  request.  Therefore,  the  la¬ 
test  requests  will  be  considered  when  constructing  the  sets 
L  and  R. 

Every  node  in  the  NA  cycle,  other  than  the  newLC,  re¬ 
ceives  partially  constructed  sets  L  and  R,  adds  its  contri¬ 
butions  to  them  and  places  the  new  versions  of  the  sets  into 
the  NA  message  which  is  forwarded  to  the  next  node  in  the 
cycle.  When  the  NA  message  returns  to  the  newLC,  the  sets  L 
and  R  are  completed.  The  sets  L  and  R  are  modified  at  site 
i  according  to  the  algorithm  given  below. 

The  actual  update  of  the  LOCK  tables  is  done  using  the 
ACP  protocol.  Therefore  if  an  additional  crash  occurs  dur¬ 
ing  the  LOCK  table  update  phase,  the  set  of  possible  states 
in  which  the  LOCK  tables  ,  L-lists  and  R-lists  may  be  left 
is  the  same  as  the  set  of  possibles  states  which  can  result 
if  the  crash  had  occurred  during  normal  operatior.  There¬ 
fore,  recovery  from  additional  failures  merely  requires  res- 


.La  Build  the  Sets  L.  and  £ 


FOR  x  C  L ( i )  DO 

IF  there  is  y  C  R  associated  with  x 
THEN  IF  sequence#(x)  >  sequence#(y) 

THEN  BEGIN 

COMMENT  lock  request  is  the  latest; 

L  :=  L  U  {x}  ;  R  :=  R  -  { y }  ; 

END; 

ELSE; 

ELSE  L  :=  L  U  { x  > ; 

FOR  y  C  R ( i )  DO 

IF  there  is  x  6  L  associated  with  y 
THEN  IF  sequence#(y)  >  sequence#(x) 

THEN  BEGIN 

COMMENT  the  release  request  is  the  latest; 
R:=RU{y}  ;L:=L-{x}  ; 

END; 

ELSE  ; 

ELSE  R  : =  R  U  {y} ; 


tarting  the  LCR  mechanism  again. 


After  the  notification  phase  is  over,  the  new  LC  will 
send  a  message  to  every  LLC  asking  them  to  update  their  LOCK 
tables.  This  message  is  called  an  "UPDATE  TABLE"  or  UT  mes¬ 
sage,  and  it  carries  within  it  the  sets  L  and  R.  The  actual 
set  R  contained  in  this  message  also  includes  lock  release 
requests  for  transactions  which  are  not  local  anymore. 
These  transactions  are  first  aborted  and  their  locks 
released  when  the  LOCK  table  is  updated.  Upon  receipt  of 
the  UT  message,  the  L-list  of  each  site  will  be  made  equal 
to  the  set  of  locks  in  the  set  L  which  are  relevant  to  the 
site.  Analogously,  the  R-list  of  each  site  will  be  made 
equal  to  the  set  of  locks  in  the  set  R  which  are  relevant  to 
the  site.  After  this  is  done,  each  site  sends  a  "READY  TO 
UPDATE"  message  or  RU  message  to  the  new  LC . 


After  receiving  a  RU  from  every  up  site  the  new  LC  be¬ 
comes  the  actual  LC  by  notifying  all  the  LLCs  that  they  can 
resume  their  normal  activity.  For  this  purpose  the  LC 
broadcasts  a  "RESUME  NORMAL  ACTIVITY"  or  RNA  message.  The 
new  value  for  U  is  the  set  of  sites  from  which  the  LC  re¬ 
ceived  a  RU  message.  This  new  value  for  U  is  included  in 
the  RNA  message,  thus  allowing  every  node  in  U  to  know  the 
composition  of  the  set  U.  Also,  upon  receipt  of  the  RNA 
message,  the  LOCK  tables  are  updated  accordingly  to  the  L- 
lists  and  R-lists. 

Let  us  now  describe  how  we  can  guarantee  that  only  one 
LC  will  emerge  from  the  notification  process.  Recall  that 
the  nominator  will  nominate  the  first  up  node  in  the  nomina¬ 
tion  sequence.  Let  us  make  the  following  definition: 

B-S.FINITIQN  1-L  (trial  sequence.  T[j,k]):  A  trial  sequence, 
T[J,k],  is  the  sequence  i[1],  i[2],  ...,  i[k-l]  of  site 
numbers  for  which  an  "ACCEPT  NOMINATION"  message  has  been 
unsuccessfully  sent  by  a  nominator  j,  before  j  sent  an  AN 
message  to  site  #k. 

For  every  AN  message  sent  from  site  #j  to  site  #k  we 
include  the  sequence  T[j,k]  as  part  of  it.  This  sequence 
will  also  be  included  as  part  of  the  "NOMINATION  ACCEPTED" 
message  which  circulates  through  the  set  of  sites.  The 
purpose  of  this  is  to  allow  any  site  to  resolve  any  conflict 


that  can  arise  due  to  the  race  conditions  discussed  earlier 


in  the  chapter.  Namely,  it  is  possible  that  more  than  one 
LC  was  nominated  and  consequently  more  than  one  NA  message 
(from  distinct  sources)  would  be  circulating.  Conflicts  are 
resolved  by  giving  preference  to  the  last  LC  to  be  nominat¬ 
ed.  NA  messages  originated  by  other  nominated  LCs  are 
killed  when  they  are  detected  to  belong  to  the  improper  LC. 
Therefore,  each  time  an  NA  message  is  received  by  node  i  the 
algorithm,  shown  below  in  Algol-like  notation,  is  executed. 

Algorithm  process  an  M  message  at  alt s  1- 

Assume  that  in  each  site  there  is  a  trial  list  T,  which 
is  empty  if  no  NA  message  has  already  been  received  in  the 
present  recovery  phase  and  that  will  be  updated  by  the  algo¬ 
rithm  below  (as  executed  at  site  i).  Let  the  variable 
NEWLC(i)  be  the  identification  of  the  new  LC-site  as  ima¬ 
gined  by  node  i.  Let  T[j,k]  be  the  trial  sequence  included 
in  the  NA  message. 

IF  T  is  empty  or  k  6  T 
THEN  BEGIN 

COMMENT  either  no  NA  message  has  already  been  received 
for  this  recovery  process  or  the  current  NA 
message  belongs  to  a  more  recently  nominated  LC ; 
T  :=  T[ j , k  3  ;  NEWLC(i)  :=  k  ; 

END; 

ELSE  Send  "KILL  NA"  message  to  site  k  informing  it  that  its 
cycle  was  aborted; 

In  many  instances  in  this  protocol  we  require  a  certain 
message  to  circulate  through  a  set  of  nodes,  as  it  is  the 


case  of  the  NA  message.  Let  us  call  such  messages 


'  c i rc  u- 
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lar  messages'.  They  always  have  a  source  or  generator  who 
is  responsible  for  sending  it  through  a  cycle.  The  underly¬ 
ing  network  protocols  assure  us  that  messages  will  not  get 
lost  while  going  from  one  site  to  another  by  the  use  of 
time-out  and  retransmission  schemes.  However,  a  circular 
message  can  still  be  lost  if  a  node  in  the  cycle  crashes 
after  receiving  it  but  before  being  able  to  forward  it.  The 
loss  of  a  circular  message  can  be  prevented  by  having  each 
node  in  the  cycle  send  to  the  circular  message  generator  an 
acknowledgment  for  it,  but  only  after  it  was  forwarded  to 
the  next  node  in  the  sequence.  Now,  the  source  is  able  to 
detect  a  cycle  interruption  and  it  can  appropriately  resume 
it  by  sending  the  last  copy  of  the  message  to  the  appropri¬ 
ate  site*.  This  source  acknowledgment  scheme  at  the  Cen¬ 
tralized  Lock  Controller  protocol  level  will  be  assumed  to 
exist  whenever  a  circular  message  is  necessary. 


It  should  be  noted  that  if  an  application  program  is¬ 
sues  a  lock  or  release  request  and  the  LC  fails  before  the 
request  is  present  at  every  site,  the  request  will  never  ap¬ 
pear  in  the  local  LOCK  table  even  after  the  LCR  is  complet¬ 
ed.  Therefore,  AP3  should  timeout  for  requests  and  resubmit 
them . 


*  This  procedure  can  be  optimized  to  a  desired  level  by 
having  only  every  i-th  node  send  an  acknowledgment  back  to 
the  generator  of  the  circular  message.  Then  there  is 
additional  uncertainty  over  which  node  in  the  cycle  failed. 
It  is  up  to  the  generator  to  resolve  the  uncertainty. 
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The  LCR  mechanism  can  be  described  in  an  abstract  form 
by  the  STDs  in  figure  3-5.  Figure  3-5.  a  is  the  STD  for  the 
newLC  and  figure  3.5.b  is  the  STD  for  any  local  lock  con¬ 
troller  participating  in  LCR.  Let  LC  be  the  identification 
of  the  oldLC  and  let  LT  be  its  LOCK  table.  Let  LC*  be  the 
newLC  and  LT*  be  its  LOCK  table.  Associated  with  each  state 
in  the  STDs  of  figure  3.5  is  a  3-tuple  of  the  form  (a,b,c) 
where  a,  b  and  c  are  the  LC  identification,  the  newLC  iden¬ 
tification  and  the  LOCK  table,  respectively. 

2- 2. 2. 2  -  Proofs  About  L.S.R 

We  would  like  to  prove  now  that  the  notification  phase 
ends  with  one  and  only  one  LC  having  been  successfully  nom¬ 
inated,  and  that  all  sites  know  the  correct  new  LC  identifi¬ 
cation.  As  a  first  step  we  state  assertions  3-3  and  3 . 
which  are  concerned  with  the  behavior  of  LCR  given  that  no 
additional  crashes  occur. 

ASSERTION  2-2:  Given  that  no  additional  crashes  occur  during 
LCR,  there  will  be  one  and  only  one  LC  whose  identification 
is  known  to  all  sites  in  the  component  at  the  end  of  the  no¬ 
tification  phase. 

Next,  let  a  globally  accepted  lock  (release)  request  be 
one  which  is  in  all  L-lists  (R-lists)  of  a  logical  com¬ 


ponent  . 
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2.1:  ( maximum 
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crashes ) :  Given  that  no  additional  crash  occurs,  the  follow¬ 
ing  is  true  at  the  end  of  the  LCR  mechanism.  A  lock  £  is  in 
the  LOCK  table  of  all  sites  of  S(p)  if  the  lock  would  have 
been  in  the  LOCK  table  of  the  crashed  LC  if  the  crash  had 
not  occurred  and  if  no  new  requests  were  submitted.  Other¬ 
wise  the  lock  is  in  none  of  the  LOCK  tables  at  the  sites  of 
S(p)  . 

Given  these  assertions  we  prove  the  robustness  of  the 
LCR  mechanism. 

THEOREM  3.2:  The  Logical  Component  Recovery  (LCR)  algorithm 
is  robust. 


Proof:  The  completion  point  for  this  algorithm  occurs 
when  the  newLC  has  logged  the  fact  that  it  is  the  LC 
after  at  least  one  RNA  message  has  been  sent.  This  point 
is  indicated  as  state  s5  in  the  STD  of  figure  3. 5. a.  The 
only  terminal  crash  is  a  newLC  failure  before  this  point. 
This  crash  when  detected  will  cause  another  LC  to  be  nom¬ 
inated  and  the  LCR  mechanism  to  be  restarted.  In  order 
to  prove  that  LCR  is  robust  we  must  prove  that  internal 
consistency  is  not  violated  by  the  partial  execution  of 
LCR.  In  the  other  words  we  need  to  show  that  the  set  of 
LOCK  tables  of  the  component  Is  LT-cons istent .  The  crash 
of  the  newLC  can  occur: 

i)  before  any  LOCK  table  has  been  updated. 
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ii)  after  some  but  not  all  LOCK  tables  have  been  updat¬ 


ed  . 

iii)  after  all  LOCK  tables  have  been  updated. 

In  case  i)  it  is  clear  that  the  partially  executed 
LCR  has  no  effect  at  all.  In  case  iii)  all  LOCK  tables 
will  be  identical,  therefore  internal  consistency  for  the 
component  in  question  is  trivially  satisfied.  In  case 
ii)  the  crash  occurred  in  the  middle  of  the  LOCK  table 
update  phase.  But  since  LOCK  tables  are  updated  using 
the  ACP  protocol,  the  states  of  the  LOCK  tables,  L-lists 
and  R-lists  at  the  instant  of  the  crash  is  a  state  which 
could  have  resulted  if  the  crash  had  occurred  during  nor¬ 
mal  operation.  Therefore,  the  partial  execution  of  the 
LCR  mechanism  does  not  violate  the  internal  consistency 
of  the  component.  The  LCR  mechanism  will  be  restarted 
again  as  many  times  as  crashes  occur. 

Now,  it  remains  for  us  to  analyze  the  transparent 
failures.  Those  are  all  the  failures  other  than  the 
newLC  crash  already  discussed.  We  can  have  either  a  pro¬ 
cess  or  processor  failure  which  simply  knocks  out  one  of 
the  sites  in  the  component,  or  the  component  can  be  par¬ 
titioned  into  two  or  more  components.  In  either  case,  a 
set  of  one  or  more  nodes  are  isolated  from  the  set  of 
nodes  which  participate  in  the  LCR  mechanism.  The  nodes 
in  this  set  will  not  be  considered  any  more  for  the  rest 
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of  the  LCR  algorithm.  However,  we  have  to  show  that  no 
inconsistencies  are  generated  by  a  node  dropping  out  dur¬ 
ing  the  execution  of  LCR. 


For  this  purpose,  we  will  examine  all  the  possible 
instants  at  which  a  node  J  may  crash. 

CASE  1:  during  the  ’nomination  phase' 

Here  we  have  to  show  that  the  sets  L  and  R  will  not 
be  perturbed  by  any  contributions  already  made  to  them  by 
node  j.  Node  j  can  crash  at  three  possible  instants. 

CASE  1.1:  before  the  NA  message  first  reaches  it. 

In  this  case  node  j  is  simply  removed  from  the  up 
list  U  without  contributing  to  the  formation  of  either  L 
or  R . 

CASE  1.2:  after  the  NA  message  reaches  it  and  before  it 
is  forwarded  to  the  next  node  in  the  sequence. 

Here,  the  node  which  sent  the  NA  message  to  node  j 

will  timeout,  detect  its  crash  and  send  the  NA  message  to 

the  node  which  follows  node  j  in  the  sequence.  Again  no 
contributions  have  been  made  to  the  sets  L  or  R. 

CASE  1.3;  after  the  NA  message  has  been  forwarded 

A  crash  of  node  j  at  this  point  is  equivalent  to  a 

crash  of  a  node  during  the  'LOCK  table  update'  phase 
since  node  j  already  played  its  role  in  the  'notification 


66 


•_"*_»** 


•vwvwvwvwvm  iiww-*  i;jrvTvtn.  »-ir»v  tv-tv  wv'rjwv'rv-mw.  Tvy.TVTVTVTV^lTW  *iEU.»tfj»mvn"«P  K»T"Hj»«.r  ■w  m  -  *  r 


phase'.  Therefore,  this  case  reduces  to  the  next  one  to 


be  examined 


The  reader  should  notice  that  the  robust¬ 


ness  of  this  recovery  mechanism  relies  heavily  on  the 
fact  that  the  ACP  protocol  is  used  in  the  LOCK  table  up¬ 
date  phase. 


CASE  2:  during  the  'LOCK  table  update  phase' 

A  crash  of  a  node  during  this  phase  will  have  no  ef¬ 
fect  upon  other  nodes,  resulting  only  in  the  removal  of 
this  node  from  the  up  list  of  the  logical  component  which 
is  recovering 


Examination  of  all  these  cases  completes  this 
proof .  [ ] 


The  above  result  allows  us  to  relax  the  assumption  made 
in  assertion  3 • 4  that  no  additional  crashes  occur  during  LCR 
and  state  the  following  assertion. 


&^.§EJTI.Q.N  1.5.:  ( maximum  forward  progress  -  additional 
failures  allowed ) ;  Let  C  be  a  logical  component.  Let  U  be 
the  up-list  which  defines  the  component  C.  Let  U'  be  a  sub¬ 
set  of  (J  which  indicates  the  nodes  which  are  actually  up  at 
the  end  of  the  LCR  mechanism.  The  following  is  true  at  the 
end  of  the  LCR  mechanism.  A  lock  £  is  in  the  LOCK  table  of 
all  sites  of  S(p)  A  U'  if  the  lock  would  have  been  in  the 


•  Note  that  U  i  U'  if  any  node  crashes  before  the  end  of  LCR 
but  after  all  the  TU  messages  have  been  received. 


LOCK  table  of  the  crashed  LC  if  the  crash  had  not  occurred 


and  if  no  new  requests  were  submitted.  Otherwise,  the  lock 
is  in  none  of  the  LOCK  tables  of  the  sites  of  S(p)  U'. 

Finally  we  prove  that  every  logical  component  is  inter¬ 
nally  consistent. 

THEOREM  3-3:  Every  logical  component  is  internally  con¬ 
sistent 

Proof:  Let  C  be  any  logical  component.  We  have  to  prove 
that : 

i)  the  set  of  LOCK  tables  of  C  is  LT-con3istent 

ii)  there  is  one  and  only  one  LC  for  C. 

Statement  i)  is  clearly  true  for  normal  operation  of 
component  C  since  assertions  3.1  and  3.2  were  demonstrat¬ 
ed  for  this  case.  Now,  by  assertion  3.5  either  a  lock  p 
is  in  all  the  LOCK  tables  of  S(p)  or  it  is  in  none  of 
them  at  the  end  of  LCR .  Therefore,  assertions  3.1  and 

3.2  are  trivially  satisfied  and  consequently  LT- 
consistency  is  preserved. 

Statement  ii)  was  proved  to  be  correct  in  assertion 

3.3  for  the  case  in  which  no  additional  crashes  occur 
during  LCR.  But,  by  theorem  3.2,  LCR  is  robust.  This 
allows  us  to  consider  the  effect  of  LCR  as  if  no  addi¬ 
tional  crashes  occur  during  its  execution,  and  concludes 
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the  proof  [ ] . 


2- 2-2-1  -  Single,  N<?de 


So  far  we  have  described  how  the  system  recovers  from  a 
logical  component  crash.  We  show  now  how  a  node  which  is 
down  becomes  active  again,  or  in  other  words,  how  it  gets 
logically  connected  to  a  logical  component.  Let  node  j  be 
such  a  node.  The  first  step  to  become  active  is  to  find 
out  the  identity  of  any  LC.  This  step  is  carried  out  by 
sending  the  "WHO  IS  THE  LC  ?"  or  WLC  message  to  any  up  node. 
Assume  first  that  node  j  received  at  least  one  reply  to  its 
WLC  message  containing  the  identity  of  an  LC.  In  this  case, 
node  j  sends  a  message  called  "HI  THERE"  or  HT  to  the  LC 
telling  him  that  node  j  is  alive  again.  If  the  LC  is  not 
undergoing  any  kind  of  crash  recovery  it  will  send  to  node 
j  the  portion  of  its  LOCK  table  relevant  to  node  j  as  well 
as  its  up  list.  An  "ACCEPT  LOCK"  or  "ACCEPT  RELEASE"  mes¬ 
sage  is  sent  to  node  j  by  the  LC  for  every  lock  or  release 
lock  request  for  which  not  all  the  LA  or  RA  messages  have 
been  received. 


Assume  now  that  node  j  does  not  get  any  answer  or  that 
all  the  nodes  which  replied  to  its  WLC  message  are  them¬ 
selves  recovering  from  a  crash  or  coming  up  from  normal  sys¬ 
tem  shutdown.  Then,  node  j  becomes  a  logical  component  on 
its  own.  Node  j  is  the  LC  for  this  logical  component  and 
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the  LOCK  table,  L-list  and  R-list  are  initialized  as  empty. 
The  Logical  Component  Merge  procedure  described  in  section 
3.2. 3-5  will  take  care  of  integrating  node  j  into  another 
logical  component. 

-  Robustness  0 f  SNR 

THEOREM  3  •  :  The  Single  Node  Recovery  (SNR)  algorithm  is 
robust. 

Proof:  The  only  case  of  interest  is  the  one  in  which  the 
recovering  node  is  able  to  get  the  identity  of  an  LC  as  a 
response  to  its  WLC  message  since,  otherwise  the  SNR 
mechanism  degenerates  into  a  LCM  as  already  explained 
above.  Let  j  be  the  recovering  node  and  let  LCi  be  the 
LC  to  which  node  j  is  trying  to  connect  with.  The  proof 
is  extremely  simple  since  the  only  two  crashes  of  in¬ 
terest  are:  a)  LLCj  crash  and  b)  LCi  crash.  Case  a)  is 
clearly  a  terminal  case.  Case  b)  is  also  a  terminal 
crash  since  a  crash  of  LCi,  before  it  is  able  to  send 
the  LOCK  table  to  LLCj,  prevents  the  LOCK  table  from  be¬ 
ing  received  by  node  j ,  thereby  implying  in  SNR  having 
to  be  restarted.  This  completes  the  proof.  [] 


Aa  a  result  of  the  Logical  Component  Recovery  algorithm 
an  LC  will  be  elected  in  each  logical  component  of  the  net¬ 
work.  Transactions  which  are  local  to  a  component  will  con¬ 
tinue  to  be  serviced  as  if  no  disconnecting  crash  had  oc¬ 
curred.  On  the  other  hand,  transactions  which  span  more 
than  one  component  will  have  to  wait  until  the  components 
involved  are  brought  together  again.  It  is  the  responsibil¬ 
ity  of  each  LC  to  detect  when  two  components  are  physically 
connected  again  and  to  take  the  necessary  steps  to  merge 
them  into  one  logical  component.  The  merge  of  logical  com¬ 
ponents  will  always  be  done  on  a  pairwise  basis.  The  whole 
Logical  Component  Merge  mechanism  i3  divided  into  two 
phases,  namely  a  'reconnection  detection'  phase  and  a 
'merge'  phase. 


In  the  'reconnection  detection'  phase,  each  LC  sends 
periodically  a  "WERE  YOU  ALIVE”  or  WYA  message  to  every  node 


not  in  its  up  list.  The  purpose  of  this  message  is  to 
detect  the  existence  of  sites  which  were  not  reachable  be¬ 
fore  but  which  were  up.  For  the  purposes  of  the  description 
that  follows,  let  the  two  logical  components  to  be  merged  be 
called  Cl  and  C2.  Let  LC1  and  LC2  be  their  respective  LCs 


and  U1  and  U2  their  respective  uplists.  LC1  will  take  an 


active  role  during  the  whole  recovery  phase,  while  LC2  will 
take  a  passive  one.  As  we  will  see,  a  crash  of  LC1  while 


the  recovery  mechanism  is  in  progress  will  result  in  abort, 
while  a  crash  of  LC2  after  the  ’reconnection  detection' 


phase  is  tolerated.  Assume  now  that  site  #j  in  C2  received 
a  WYA  message  from  LC1.  A  component  is  said  to  be  in  NORMAL 
status  if  it  is  not  undergoing  any  kind  of  crash  recovery 
mechanism.  If  component  C2  is  in  its  NORMAL  status,  site  #j 
sends  a  "YES  I  WAS"  or  YIW  message  to  L  C  1  .  This  message 
carries  within  it  the  identification  of  LC2. 

At  this  point  LC1  has  to  establish  a  logical  connection 
with  LC2.  This  connection  is  called  a  primary-secondary  or 
P-S  connection  type  with  LC1  being  the  primary  and  LC2  the 
secondary.  Since  we  require  that  LCM  be  done  in  a  pairwise 
basis,  the  following  conditions  must  be  enforced  by  the  pro¬ 
tocol  that  establishes  a  P-S  connection: 

Cl:  an  LC  cannot  be  primary  (secondary)  for  more  than  one 

P-S  connection. 

C2:  an  LC  cannot  be  primary  and  secondary  simultaneously. 

The  P-S  connection  is  attempted  by  having  LC1  send  a 
"LET  US  MERGE"  or  LUM  message  to  LC2.  The  status  of  LC1  is 
now  changed  to  ATTEMPT.  If  the  status  of  LC2  is  NORMAL, 
which  means  that  neither  Logical  Component  Merge  nor  Logical 
Component  Recovery  is  being  attempted,  LC2  sends  a  "MERGE 
ACCEPTED"  or  MA  message  to  LC1  and  changes  its  internal 
state  to  SECONDARY.  Upon  receipt  of  the  MA  message  the  con- 
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nection  is  considered  to  be  successfully  established  by  LC1. 
If  the  status  of  LC2  is  not  NORMAL  then  a  "MERGE  ATTEMPT  RE¬ 
JECTED"  or  MAR  message  is  sent  to  LC1  which  will  either  re¬ 
try  later  or  will  try  a  connection  with  another  LC. 

The  above  interconnection  strategy  could  clearly  allow 
undesirable  race  conditions  to  occur,  such  as  having  two  LCs 
trying  to  play  the  role  of  primary,  leading  the  system  into 
deadlock  situations.  To  avoid  this  problem,  we  assign  a 
site  dependent  priority  to  each  LC  (no  two  sites  have  the 
same  priority).  LUM  messages  from  lower  priority  LCs  are 
rejected.  LUM  messages  from  higher  priority  LCs,  if  re¬ 
ceived  while  the  connection  has  not  yet  been  completed,  i.e. 
the  MA  message  has  not  been  received,  cause  the  connection 
being  attempted  to  be  broken.  To  this  end  the  primary 
sends  a  "CLOSE  CONNECTION"  or  CC  message  to  its  intended 
secondary. 

That  the  protocol  outlined  above  satisfies  conditions 
Cl  and  C2  is  proved  in  section  3-2. 4.1.  Figure  3-6  shows  a 
state  transition  diagram  describing  the  interconnection  pro¬ 
tocol.  This  protocol  is  the  same  for  every  node.  Node  la¬ 
bels  are  STATUSes,  while  arc  labels  are  of  the  form  R/T 
where  R  is  the  message  whose  arrival  triggers  the  transi¬ 
tion  and  T  is  a  sequence  of  actions  (transmission  of  mes¬ 
sages)  which  occur  as  a  consequence  of  the  transition. 
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Figure  3-6  -  STATE  TRANSITION  DIAGRAM  FOR  P-S 
CONNECTION  ESTABLISHMENT.  A  plus  (+)  sign  indi¬ 
cates  reception  of  a  message  and  a  minus  (-)  sign 
indicates  transmission  of  a  message.  The  sign  < 
indicates  that  the  message  in  question  originates 
from  a  lower  priority  source,  while  >  indicates  a 
higher  priority  site.  The  dollar  ($)  sign  indi¬ 
cates  that  no  action  is  taken  due  to  a  state  tran¬ 
sition  . 


7u 

I 


2/4 


AD-A186  683  SECURE  DISTRIBUTED  PROCESSING  SVSTEHS(U)  CALIFORNIA 
UNIV  LOS  ANGELES  SCHOOL  OF  ENGINEERING  AND  APPLIED 
SCIENCE  G  J  POPEK  DEC  78  UCLA-ENG-7995 
UNCLASSIFIED  H0A9B3-77-C-8211  F/G  12/7  NL 


mmBmmmmmammmmmmmrmmnommmtmmmmamMammvmmmmn nww wm— nwiwf 

After  a  P-S  connection  has  been  established  between  LC1 
and  LC2,  they  will  not  accept  any  more  new  lock  or  lock 
release  requests  from  nodes  in  their  components  and  will 
complete  all  outstanding  ones.  An  outstanding  request  is 
one  for  which  all  AL  or  AR  messages  have  been  already  sent 
but  not  all  the  corresponding  LA  or  RA  messages  have  been 
received.  After  all  outstanding  requests  have  been  complet¬ 
ed  by  LC2  it  sends  to  LC1  a  "READY  TO  MERGE"  or  RTM  message 
containing  as  arguments  the  uplist  U2  and  the  LOCK  table  at 
LC2  which  now  is  the  same  for  all  nodes  in  C2.  The  receipt 
of  the  RTM  message  by  LC1  marks  the  end  of  the  'reconnection 
detection'  phase. 

The  'merge'  phase  will  construct  the  union  of  the  LOCK 
tables  at  both  components.  Notice  that  up  to  this  point  no 
permanent  change  has  been  done  to  any  LOCK  table,  nor  up 
list  of  any  node.  LC1  sends  a  "SUBSTITUTE  YOUR  TABLE"  or 
SYT  message  for  a  cycle  through  the  set  of  nodes  in  TEMP_U  = 

U1  U  U2.  The  SYT  message  is  the  agent  which  confirms  the 
merge  of  the  two  components  by  taking  within  it  the  new  LOCK 
table  for  the  component.  Also,  the  up  lists  are  updated  and 
LC1  becomes  the  new  LC  of  the  new  logical  component. 
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THEOREM  3.5:  The  Logical  Component  Merge  (LCM)  Algorithm  is 
robust . 

Proof:  The  completion  point  for  the  LCM  algorithm  is  the 
point  where  the  SYT  message  has  already  been  received  and 
accepted  by  one  LLC. 

Let  LT(i),  U(i)  and  LC(i)  be  respectively  the  LOCK  table 
at  site  i,  the  up  list  at  site  i  and  the  LC  identifica¬ 
tion  as  known  by  site  i.  It  is  worth  observing  that 
changes  to  the  values  of  LT(i),  U(i)  and  LC(i)  at  any 
site  i  other  than  the  LC-1  site  are  only  done  upon  re¬ 
ceipt  of  the  SYT  message. 

Let  us  examine  the  possible  cases  of  crashes  before 
the  completion  point: 

CASE  1:  crashes  during  the  'reconnection  detection'  phase 

A  crash  of  either  LC1  or  LC2  in  this  phase  will 
cause  LCM  to  be  aborted  and  a  LCR  to  be  started  at  the 
component  who  had  an  LC-crash.  Since  no  LOCK  table  nor 
up  list  has  been  changed  so  far,  this  is  a  terminal 
crash.  Since  LC1  and  LC2  are  the  only  processes  Involved 
in  this  phase,  we  conclude  that  this  phase  is  robust. 

CASE  2:  crashes  during  the  'merge'  phase 


A  crash  of  LC1  during  this  phase  will  interrupt  LCM 
and  start  LCR  for  component  Cl.  As  no  permanent  changes 
have  been  done  already,  this  is  a  terminal  crash.  A 
crash  of  any  other  node  (including  LC2)  clearly  does  not 
affect  any  other  node  nor  the  mutual  consistency  of  the 
merged  logical  component  []. 

1*2. JL.  *  Pl3,.l<2la.&n?3g  O-L  the  Recovery  Algorithms 

We  show  here  that  there  is  no  interaction  between  the 
three  recovery  algorithms.  To  that  effect  one  has  to  show 
that : 

a)  LCM  is  done  pairwise 

b)  LCR,  LCM  and  SNR  are  mutually  exclusive. 

To  verify  condition  a)  we  only  need  to  show  that  condi¬ 
tions  Cl  and  C2  stated  in  section  3. 2. 3. 5  are  satisfied  by 
the  P-S  connection  protocol.  This  verification  is  done  in 
section  3. 2. 4.1.  Condition  b)  is  shown  to  hold  in  section 
3- 2. 4. 2. 

2*2. 1-1  -  Dis.lointness  sX  LCMs 

Consider  a  directed  graph  G  whose  vertex-set  is  the  set 
of  LCs  and  which  has  two  distinct  types  of  arcs,  namely  e- 
arcs  and  a-arcs.  There  is  an  e-arc  from  vertex  i  to  vertex 
J  if  there  is  an  established  P-S  connection  between  vertices 
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i  and  j  ,  vertex  i  being  the  primary.  Equivalently,  an  e-arc 
from  vertex  i  to  vertex  j  is  said  to  be  created  in  G  whenev¬ 
er  vertex  i  enters  the  CONNECTION  ESTABLISHED  state  (see 
figure  3.6).  There  is  an  a-arc  from  vertex  i  to  vertex  j  if 
vertex  i  is  attempting  a  P-S  connection  to  vertex  j.  Such 
an  a-arc  is  created  as  soon  as  vertex  i  enters  the  ATTEMPT 
state  (see  figure  3.6).  The  graph  G  displays  the  pattern 
of  established  and  attempted  connections.  Let  e-G  be  the 
subgraph  obtained  from  G  by  considering  only  e-arcs  of  G  and 
a-G  be  the  one  obtained  by  taking  only  the  a-arcs. 

Conditions  Cl  and  C2  can  now  be  rephrased  as  follows: 

Cl.t:  0  i  indegree(v)  £  1  and  0  £  outdegree(v)  .£  1  for 

all  v  in  e-G. 


C2.1:  indegree(v)  •  outdegree(v)  «  0  for  all  v  in  e-G. 


Every  a-arc  will  either  be  deleted  from  G  when  the  at¬ 
tempted  connection  is  broken  or  will  become  an  e-arc  if  the 
connection  is  successfully  established.  So,  we  want  to 
prove  the  following:  , 


THEOREM  3.6:  Given  a  graph  G  whose  e-graph  satisfies  condi¬ 
tions  C1.1  and  C2.1,  the  new  e-graph  obtained  from  G  as  new 
connections  are  established  also  satisfies  those  conditions. 
Proof:  It  can  easily  be  seen,  from  the  protocol  specifica¬ 
tion,  that  condition  C1.1  is  satisfied  not  only  by  the  ini¬ 
tial  e-graph  but  also  by  the  graph  G,  since: 
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a)  if  there  is  already  a  connection  between  vertices 
i  and  j  or  one  is  being  attempted,  no  new  connec¬ 
tion  is  attempted  by  neither  vertex  i  nor  vertex 


J. 

b)  if  a  connection  has  already  been  established  or  is 
being  attempted,  the  secondary  will  reject  all 
further  attempts. 

So,  it  remains  for  us  to  examine  all  the  possible  cases 
in  which  condition  C2.1  could  conceivably  be  violated  in  G 
and  show  that  the  resulting  e-graph  obtained  when  one  or 
more  a-arcs  become  e-arcs  still  satisfies  this  condition. 
There  are  four  possible  cases,  two  of  which  can  never  happen 
due  to  the  protocol  specification,  while  the  remaining  two 
have  to  be  examined.  Given  any  three  vertices  a,  b  and  e, 
the  four  possible  cases  are: 

a)  (a,b)  and  (b,c)  are  e-arcs. 

b)  (a,b)  is  an  e-arc  and  (b,c)  is  an  a-arc. 

c)  (a,b)  is  an  a-arc  and  (b,c)  is  an  e-arc. 

d)  (a,b)  and  (b,c)  are  a-arcs. 


Cases  a)  and  b)  are  the  impossible  ones.  In  case  c) 
the  attempted  connection  between  a  and  b  will  fail  since 
there  is  an  established  connection  from  b  to  c  (see  the  self 
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loop  at  the  CONNECTION  ESTABLISHED  state  of  the  diagram  of 


figure  3*6).  Therefore,  arc  (a,b)  will  disappear.  In  case 


d)  nodes  a  and  b  are  in  the  ATTEMPT  state.  If  (a,b)  becomes 


an  e-arc  we  can  see  that  the  transition  labeled  LUM/CC;MA 


from  state  ATTEMPT  to  the  state  SECONDARY  is  taken  at  ver¬ 


tex  b,  causing  the  attempted  connection  (b,c)  to  be  broken. 


Therefore  arc  (a,b)  becomes  an  e-arc  while  arc  (b,c)  disap¬ 


pears.  On  the  other  hand,  if  (b,c)  becomes  an  e-arc  in  the 


first  place  we  are  back  to  case  c)  which  was  already  exam¬ 


ined.  [] 


We  take  the  opportunity  here  to  prove  that  the  P-S  con¬ 


nection  protocol  is  such  that  all  the  a-arcs  in  G  will,  in  a 


finite  time,  (of  the  order  of  magnitude  of  the  transmission 


delay  time  in  the  network)  either  disappear  or  become  e- 


arcs.  In  other  words,  the  P-S  connection  protocol  is 


deadlock  free. 


THEOREM  3*7:  The  P-S  connection  protocol  is  deadlock  free. 


Proof:  We  must  prove  that  there  can  be  no  long  lasting  cy¬ 


cles  in  G.  The  interesting  case  is,  of  course,  that  of  cy¬ 


cles  made  out  only  of  a-arcs,  since  as  shown  in  the  previous 


theorem,  any  a-arc  adjacent  to  an  e-arc  will  disappear  in  a 


finite  time. 


Consider  a  cycle  in  a-G  and  two  adjacent  a-arcs  (a,b) 


and  (b,c)  in  the  cycle.  Vertices  a  and  b  are  in  the  ATTEMPT 


state.  There  are  only  two  possible  cases  to  consider: 
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CASE  1:  [ PRIORITY ( a )  >  PRIORITY( b) ] :  In  this  case,  if  the 
"MERGE  ACCEPTED"  message  from  vertex  c  is  received  by  b  be¬ 
fore  the  "LET  US  MERGE"  message  from  a  then  (b,c)  becomes  an 
e-arc  and  (a,b)  disappears. 

i 

I 

CASE  £:  [PRIORITY ( a )  <  P RI0RITY( b ) ] :  Here,  arc  (a,b)  will 

i  disappear  since  a  has  lower  priority  than  b. 

i 

I 

i  In  any  event,  the  cycle  will  be  eventually  broken. 

Note  also,  that  vertex  c  could  be  the  same  as  a  and  the 
above  analysis  is  still  valid.  [] 

1  -  Pis  lolntness  of  LCR .  LCM  and  SNR 

He  first  define  a  node  atatS  transition  diagram  as  a 
directed  graph  whose  vertices  are  states  of  a  network  node 
and  whose  arcs  represent  transitions  between  states.  The 
state  of  a  node  i  is  the  3-tuple  [STATUS(i),  LC(i),  U(i)], 
where  LC(i),  U(i)  are  as  defined  before.  STATUS(i)  is  the 
status  of  the  component  to  which  site  i  is  attached  as 
viewed  by  site  i.  NORMAL  status  indicates  that  neither  LCR 
nor  LCM  is  in  progress;  RECOVERY  means  that  LCR  is  taking 
place  and  QUIESCENT  indicates  that  LC(i)  is  rejecting  furth¬ 
er  requests.  The  labels  on  the  arcs  specify  the  conditions 
upon  which  a  transition  between  two  states  occurs.  These 


conditions  can  either  be  a  crash  detection  or  a  message  ar¬ 
rival.  The  diagram,  shown  in  figure  3.7,  shows  all  possible 
state  transitions  for  a  node,  other  than  LC1,  which  is  in  a 
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Figure  3.7  -  MODE  STATE  TRANSITION  DIAGRAM.  The 
following  relationships  are  observed: 

-  OJ  *  01  0  02  where  LC2  is  in  02. 

-  01  £  01. 

-  LCi  is  in  Oi. 
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component  Cl,  with  LC  equal  to  LC1  and  up  list  equal  to  U1. 
From  every  state  there  is  a  transition  to  the  DOWN  state. 
These  transitions  are  not  represented  in  the  diagram  for  ob¬ 
vious  reasons. 


The  state  [NORMAL,  LC j ,  Uj]  is  state  which  resulted 
from  a  successful  merge  of  component  Cl  with  another  com¬ 
ponent,  for  instance  C2.  The  state  [NORMAL,  LCi ,  Ui]  is  a 
state  which  resulted  from  a  successful  Logical  Component 
Recovery . 


By  inspection  of  the  diagram,  we  observe  that  a  node 
can  only  go  from  one  normal  state  to  a  different  normal 
state  after  one  and  only  one  recovery  mechanism  has  been 
completed.  Therefore,  there  is  no  interaction  among  the 
three  recovery  mechanisms. 

-  LCKlCfll  Earns 9 US Q1  Mutual  Consistency 

Let  us  show  here  that  the  CLC  protocol  (including  the 
recovery  mechanisms)  is  such  that  the  set  of  Logical  Com¬ 
ponents  into  which  the  network  is  partitioned  is  mutually 
consistent . 


THEOREM  3*8:  The  set  of  logical  components  into  which  the 
network  is  partitioned  is  mutually  consistent. 

Proof:  By  theorem  3.3  each  one  of  the  logical  components 
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is  internally  consistent.  It  remains  for  us  to  prove 


that  there  can  be  no  lock  present  at  any  LOCK  table  of 


any  component  which  conflicts  with  another  such  lock  of 


any  other  component.  This  theorem  is  trivially  true  when 


there  is  only  one  logical  component.  Further  net  parti¬ 


tioning  does  not  destroy  this  property  since  locks  are 


only  granted  if  they  are  local  to  a  component,  which  im¬ 


plies  that  they  do  not  conflict  with  any  other  lock 


granted  at  any  other  component.  [] 


1.2..L*  -  Database 


We  show  here  that  given  a  deadlock  free,  consistency 


preserving  locking  mechanism  for  a 


database 


( CDB) ,  the  CLC  protocol  can  be  used  to  implement  an 


equivalent  robust,  deadlock  free,  consistency  preserving 


locking  mechanism  for  a 


database  ( DDB) .  A  data¬ 


base  is  said  to  be  in  a  consistent  state  if  all  the  data 


items  satisfy  a  set  of  assertions  or  consistency  con¬ 


straints.  A  transaction  is  a  sequence  of  accesses  which 


take  the  database  from  a  consistent  state  into  another  con¬ 


sistent  state.  Thus,  a  transaction  is  the  unit  of  con¬ 


sistency.  Let  us  define  an  access  as  the  pair  (P,a)  where 


P  is  a  logical  description  of  the  portion  of  the  database  to 


be  accessed  and  a  is  an  access  mode  (e.g. 


read , write , delete , etc .) .  If  all  the  locks  are  granted  by  a 


process  which  has  complete  knowledge  of  every  other  active 


I 


S 


hi’ 

k 


I 


'7y. 

*A* 

V' 


fc'.V 


iv/ 

fe 

% 

Si 


$ 


I' 


$ 


I 

!»5 


I'M*’ 

© 

& 


locks  (as  is  the  case  with  the  LC)  and  if  every  access  is 


checked  against  the  LC  copy  of  the  LOCK  table  (this  condi¬ 


tion  will  be  relaxed  later),  to  see  whether  the  transaction 


holds  the  necessary  locks  ,  then  the  'lock  scheduler'  for  a 


CDB  described  by  Eswaran  [1]  can  be  implemented  in  a 


straightforward  manner  with  the  use  of  the  CLC  protocol. 


Such  a  locking  mechanism  has  the  properties  of  being  robus 


and  preserving  the  consistency  of  the  DB.  Notice  that 


deadlock  prevention  or  detection  mechanisms  can  be  carried 


out  by  the  LC  since  it  has  complete  control  over  all  activi¬ 


ties  in  its  component.  Recall  that  if  the  network  is  parti¬ 


tioned  into  more  than  one  component,  locks  granted  in  one  of 


them  do  not  conflict  with  locks  active  in  others.  There¬ 


fore,  distinct  LCs  manage  disjoint  sets  of  "resources", 


where  a  resource  here  means  an  individually  lockable  data 


item  in  the  DB.  So,  a  deadlock  prevention  or  detection  pol¬ 


icy  can  be  implemented  in  each  LC  independently  of  all  the 


others . 


The  requirement  that  every  access  be  checked  against 


the  LOCK  table  at  the  LC-site  can  be  relaxed  in  favor  of 


having  the  access  checking  done  locally.  In  order  for  this 


to  be  possible  a  lock  must  be  considered  to  be  active  at  a 


given  site  i  for  a  time  interval  T2  contained  in  the  time 


interval  T1  during  which  the  lock  is  active  at  the  LC-site, 


otherwise  some  portions  of  the  DB  could  be  locked  in  con¬ 


flicting  modes  for  different  transactions.  Figure  3.8  shows 
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a  double  time  axis  diagram  displaying  time  at  the  LC-site 


and  at  a  given  site  i  where  a  lock  request  is  originated. 
T1  starts  when  the  "CONFIRM  LOCK"  message  is  sent  to  every 
site  in  the  component  and  ends  with  the  broadcast  of  the 
"CONFIRM  RELEASE"  message.  T2  starts  with  the  arrival  of  a 
CL  message  at  site  i.  Although  a  lock  is  only  removed  from 
a  LOCK  table  when  the  corresponding  "CONFIRM  RELEASE"  mes¬ 
sage  arrives,  it  can  be  flagged  as  ’waiting  for  removal'  as 
soon  as  a  "RELEASE  LOCK"  message  is  sent  from  the  LC  to  site 
i.  For  access  checking  purposes,  all  flagged  locks  must  be 
considered  as  non  active.  The  extra  precaution  that  must  be 
taken  in  this  case  is  to  unflag  all  flagged  locks  after  LCR 
has  taken  place. 


-  £99t  and.  Delay 


An  analysis  of  the  cost  and  delay  associated  with  the 
CLC  protocol  is  presented  in  this  section.  The  types  of 
costs  we  can  consider  are: 


-  communication  cost:  this  is  the  cost  associated  with 
the  exchange  of  messages  required  by  the  protocol. 


-  processing  cost 


-  storage  overhead  cost 


Figure  3*8  -  T1  is  the  time  Interval  during 
which  a  lock  is  considered  to  be  active  at 
the  site  where  the  LC  is  located.  12  is  the 
time  during  which  the  saae  lock  is  consired 
to  be  active  at  the  requesting  site  #  i. 


l-Z-l-l  -  Delay  Analysis 


This  section  considers  the  delay,  introduced  by  the  CLC 
protocol,  for  a  lock  request  to  be  granted,  DL,  and  the  de¬ 
lay,  DR,  for  a  lock  to  be  released.  The  average  delay, 
Dupdt ,  for  an  update  request  to  be  completed  is  also  calcu¬ 
lated  for  the  update  model  described  here. 

Let , 

TMAX  —  average  maximum  delay  introduced  by  the  network. 

T  a  average  message  delay  introduced  by  the  network. 

V  s  average  waiting  time  for  a  lock  request  to  be 
granted  at  the  LC. 

Dupdt  a  average  delay  for  an  update  given  that  no  crash 
occurs  during  the  whole  period. 

DL  a  average  delay  for  a  lock  request  to  be  granted 
under  no  crash  conditions. 


DR  a  average  delay  for  a  lock  to  be  released  under  no 
crash  conditions. 

n  a  number  of  sites. 

m  a  number  of  sites  participating  in  a  transaction. 

We  will  use  the  notation  X  ->  Y  :  M  to  denote  the  fact 
that  process  X  sends  message  M  to  process  Y.  The  processes 
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grains),  the  LC  (lock  controller)  and  the  set  of  LLCi*s  (lo 
cal  lock  controllers)  . 


Let  us  first  calculate  DL. 

DL  =  T  ♦  (AP  ->  LC  :  LR) 

W  +  (delay  due  to  conflicts  in  LT) 

TMAX  +  (LC  ->  LLCi  :  AL  ) 

TMAX  +  (LLCi  ->  LC  :  LA  ) 

T  (LC  ->  AP  :  LG  ) 

=  2* ( T  ♦  TMAX)  +  W  (3.1) 

He  have  neglected  in  the  above  calculations  the  pro¬ 
cessing  time  at  the  several  sites.  Also,  the  exchange  of 
messages  between  LC  and  every  other  LLCi  takes  place  in 
parallel.  This  gives  rise  to  the  term  TMAX  in  (3-1). 

Let  us  now  calculate  DR. 

DR  =  T  ♦  (AP  ->  LC  : RL) 

TMAX  ♦  (LC  ->  LLCi  :  AR) 

TMAX  +  (LLCi  ->  LC  :  RA) 

T  (LC  ->  AP  :  RG) 

=  2*(T  +  TMAX)  (3.2) 


In  order  to  evaluate  Dupdt  we  will  use  a  model  for  up¬ 
date  in  which  the  lock  and  release  requests  and  their 
corresponding  messages  travel  together  with  the  update  re- 
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quest  in  a  single  physical  message.  The  following  calcula¬ 
tion  shows  explicitly  the  exchange  of  messages  involved 
here . 


Dupdt  = 


+  (AP  ->  LC  :  LR+update  request+RL) 


W+TMAX  +  (LC  ->  LLCi  :  AL+AR ) 


TMAX  +  (LLCi  ->  LC  :  LA+RA) 


TMAX  ♦  (LC  ->  LLCi  :  CL+do  update+CR) 


(LC  ->  AP  :  update  done) 


2*T  +  3*TMAX  +  W 


(3.3) 


Note  that  although  some  previously  defined  messages  are 


now  grouped  together  into  a  single  physical  message,  as  in¬ 


dicated  additively  in  the  above  calculations,  they  are  pro¬ 


cessed  as  separate  messages  and  as  if  they  were  received  in 


the  above  specified  order.  For  instance,  CL+do  update+CR 


is  equivalent  to  three  messages,  namely  CL,  do  update  and 


CR,  received  in  this  order.  The  update  is  performed  at  the 


LC-site  upon  receipt  of  all  the  LA+RA  messages  and  it  is 


performed  at  each  of  the  remaining  sites  upon  receipt  of  the 


CL+do  update+CR  message. 


Let  us  now  calculate  the  average  delay,  R,  involved  in 


the  Logical  Component  Recovery  mechanism.  As  was  discussed 


earlier,  the  crash  recovery  is  divided  into  two  phases:  the 


’notification  phase'  and  the  'LOCK  Table  update  phase'. 
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If  we  let  Tout 


a#T  where  a  is  constant  (  a  >  2)  we 


have 

R  =  [a*(k-1 )  +  n  ♦  1 ] »T  +  3#TMAX  (3.7). 

Since  1  i  k  <  n  we  can  find  the  following  lower  and 


upper  bounds  for  R. 

R  2  (n  ♦  1 ) *T  +  3*TMAX 
and 

R  <  [ a* ( n-2 )  +  n  ♦  1]«T  ♦  3*TMAX  (3.8). 

1-2.-1-Z  -  £g..at.  Analysis  for  CLC  Protocol 


We  will  neglect  the  processing  cost 
average  comaunication  coat  per  message. 


here.  Let  M  be  the 
Let  us  assume  here 


I 

i 

i 

I 


that  the  site  at  which  the  LC  resides  is  one 
participating  in  the  transaction.  If  this 
one  must  add  one  extra  message  per  broadcast 


of  the  sites 
is  not  the  case, 
message  in  the 


protocol . 


The  number  of  messages  required  to  have  a  lock  granted 
under  no  crash  conditions,  Nl,  is 


j  Ml  *  1  ♦  (AP  ->  LC  :  LR) 

' 

|  (m-1)  +  (LC  ->  LLCi  :  AL) 

j  (m-1)  +  (LLCi  ->  LC  :  LA) 

1  ♦  (LC  ->  AP  :  LG) 

(m-1)  (LC  ->  LLCi  :  CL) 

i 

I 

I 


3*m  -  1 


(3-9) 


Also,  the 

number  Hr  of 

messages  necessary  to 

release 

a  lock  is: 

Hr  =  1 

+ 

(AP  ->  LC  : 

RL) 

(m-  1 ) 

+ 

(LC  ->  LLCi 

:  AR) 

(m-1 ) 

+ 

(LLCi  ->  LC 

:  RA) 

1 

♦ 

(LC  ->  AP  : 

RG) 

(m-1 ) 

(LC  ->  LLCi 

:  CR) 

=  3§m  - 

1 

(3. 

10) 

Using  our 

previously  defined  update  model,  the 

number 

of  messages  exchanged  in  order  to  perform  an  update,  Nupdt, 
is 


Hupdt  s  1 

+ 

(AP  -> 

LC  : 

LR+update  request+RL) 

m- 1 

+ 

(LC  -> 

LLCi 

:  AL+AR) 

m-  1 

+ 

(LLCi 

->  LC 

:  LA+RA) 

m- 1 

+ 

(LC  -> 

LLCi 

:  CL+do  update+CR) 

1 

(LC  -> 

AP  : 

update  done) 

=  3*m 

- 

1 

(3 

Therefore  the  average  communication  cost  per  update, 
Cupdt  is, 

Cupdt  =  ( 3*m  -  1 )*M  (3-11.2) 

where  again  ja  is  the  number  of  sites  participating  in  the 
transaction  and  not  the  total  number  of  sites  in  the  net- 
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Let  us  now  calculate  the  recovery  cost,  Crec.  The 
number  of  messages,  N1,  exchanged  during  the  recovery  phase 
given  that  k  AN  messages  were  sent  by  the  nominator  can  be 
calculated  as  follows. 

N1  s  k  ♦  (k  AN  messages) 

n  ♦  (cycle  of  NA  message) 

(n-2)+  (  source  ACK  messages) 

( n- 1 ) ♦  (newLC  ->  LLCi  :  UT) 

( n- 1 ) ♦  (LLCi  ->  newLC  :UT) 

( n-  1 )  (newLC  ->  LLCi  :  RNA ) 

*  k  ♦  5»n  -  5  (3.12) 

As  1  ±  k  <  n,  we  can  find  the  following  lower  and  upper 
bounds  for  Crec. 

Crec  2.  ( 5#n  -  M)»M 

and  (3.13) 

Crec  <  (6*n  -  6)  •  M 

-  Extension 

It  has  been  observed  in  most  of  the  existing  distribut¬ 
ed  systems  that  a  large  percentage  of  the  generated  tran¬ 
sactions  is  local .  in  the  sense  that  the  resources  needed  to 
satisfy  a  given  transaction  are  either  located  at  the  site 
of  origin  of  the  transaction  or  in  neighboring  sites.  This 
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observation  suggests  that  significant  savings  in  terms  of 
communications  cost  and  delay  can  be  achieved  if  one  optim¬ 
izes  the  operation  of  the  algorithm  to  adapt  to  such  a  high¬ 
ly  skewed  distribution  of  activity.  To  illustrate  the 
point,  consider  a  set  of  interconnected  computer  networks. 
We  believe  that  in  such  a  case,  most  of  the  operations  will 
be  confined  to  one  computer  network  while  relatively  few 
operations  will  cross  network  boundaries. 


This  section  outlines  an  extension  to  the  CLC  protocol 
that  permits  the  forms  of  performance  optimization  needed 
for  the  cases  discussed  above.  The  extension,  which  we  call 
an  HCLC  (for  Hierarchical  CLC)  protocol,  consists  of  a 
hierarchical  organization  of  resource  controllers.  A  tree 
of  controllers  is  provided  where  the  root  is  considered  to 
be  at  level  0  and  all  the  children  of  a  controller  at  level 
i  are  at  level  i+1  in  the  hierarchy. 


Each  controller  (except  for  the  leaves)  serves  as  an  LC 
for  its  children.  Also,  each  controller  (except  for  the 
root  of  the  hierarchy)  acts  as  an  LLC  for  its  parent. 
Therefore,  each  controller  has  to  maintain  two  distinct  LOCK 
tables,  which  we  call  parent-LT  and  child-LT.  The  parent-LT 
for  the  root  controller  contains  one  lock  for  the  whole  DB 
in  exclusive  mode.  The  child-LT  for  a  leaf  is  empty. 


An  intuitive  description  of  the  normal  operation  of  the 
HCLC  protocol  can  be  easily  understood  in  the  light  of  an 
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example.  Figure  3.9  shows  a  three-level  hierarchy.  Appli¬ 


cation  programs  interact  with  lock  controllers  K1  and  K2  at 


level  above  the  leaves  (since  the  leaves  are  LLCs). 


This  interaction  is  the  same  as  the  AP-LC  interaction  in  the 


CLC  protocol.  Actually,  application  programs  are  not  aware 


of  the  fact  that  the  controllers  are  hierarchically  organ¬ 


ized.  Let  a  lock  request,  x,  from  API  be  submitted  to  K1. 


If  x  conflicts  with  any  other  lock  in  child-LT(KI)  then  the 


lock  request  is  treated  in  the  same  way  as  in  the  CLC  pro¬ 


tocol.  If  there  is  no  conflict,  Kl’s  parent-LT  is  searched 


for  a  lock  y  which  covers  x.  A  lock  xl  is  said  to  cover 


lock  x2  if  the  portion  of  the  DB  specified  by  x2  is  con¬ 


tained  in  the  portion  of  the  DB  addressed  by  xl  and  if  the 


lock  mode  specified  by  xl  Is  not  weaker  than  the  lock  mode 


in  x2 .  The  existence  of  a  lock  such  as  y  in  parent-LT ( K 1 ) 


indicates  that  K1  currently  has  control  over  the  resources 


requested  by  API.  If  y  is  found,  the  lock  request  x  can  be 


granted  and  to  this  end  K1  interacts  with  K 3  and  K4  in  the 


same  way  as  an  LC  interacts  with  the  LLCs  in  its  component 


On  the  other  hand,  if  y  cannot  be  found,  the  lock  request  x 


is  submitted  by  K1  to  KO.  KO  will  act  with  respect  to  K1 


and  K 2  in  the  same  way  that  K1  did  with  respect  to  K3  and 


K4.  The  difference  in  this  case  is  that  since  KO  is  the 


root  there  is  a  lock  in  parent-LT(KO)  for  the  whole  DB  in 


exclusive  mode.  This  lock  covers  any  other  lock. 


mm 
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In  an  HCLC  protocol,  locks  may  be  released  either 


Locks  in  child-LT(Ki) ,  for 


i=1,2,  are  released  explicitly  upon  request  from  APs  using 


the  same  mechanism  described  in  the  CLC  protocol.  Locks  in 


parent-LT (Ki ) ,  for  i=1,2,  can  be  released  automatically  as 


soon  as  there  are  no  locks  in  the  corresponding  child-LTs 


which  depend  upon  them.  To  this  end,  each  lock  y,  in 


parent-LT ( K ) ,  for  any  controller  K,  has  associated  with  it  a 


list  of  locks  in  child-LT(K)  covered  by  y.  Also,  each  lock 


x  in  a  child-LT(K)  points  to  the  lock  y  in  parent-LT(K) 


which  covers  x.  When  a  lock  x  is  explicitly  released  from 


child-LT(Kl)  the  lock  list  for  its  corresponding  lock,  y,  in 


parent-LT(K1 )  is  appropriately  updated.  Whenever  this  list 


becomes  empty,  a  release  request  may  be  automatically  gen¬ 


erated  by  KI  and  submitted  to  KO .  In  general,  the  automatic 


release  of  locks  can  be  propagated  up  to  the  root. 


This  hierarchical  protocol  can  be  easily  adjusted  by 


policy  decisions  both  to  delay  such  releases,  and  to  estab¬ 


lish  early  locks  at  higher  levels  in  anticipation  of  local 


lock  requests.  Lock  management  analogous  to  LRU-like  memory 


management  policies  are  obvious  policy  candidates. 


For  the  set  of  interconnected  computer  networks,  a 


three-level  hierarchy  could  be  constructed  as  follows 


There  is  one  LC  per  computer  network,  all  of  them  at  level 


1.  Their  children,  at  level  2,  are  their  corresponding 
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LLCs.  Finally,  the  root  is  any  site  acting  as  a  global  con¬ 
troller  for  the  entire  collection  of  computer  networks. 


An  interesting  property  of  the  proposed  extension  is 
that  there  is  always  one  controller  which  is  able  to  detect 
the  existence  of  a  cycle  in  the  lock-request  graph.  This 
controller  is  the  common  ancestor,  with  the  largest  level 
number,  to  all  the  controllers  where  requests  in  the  cycle 
where  originated.  In  the  example  of  figure  3*9,  the  common 
ancestor  to  K1  and  K2  is  KO. 


Crash  recovery  algorithms  for  the  HCLC  protocol  must 
include  mechanisms  to  reconstruct  the  hierarchy,  in  addition 
to  the  recovery  mechanisms  present  in  the  CLC  protocol. 


1-2  - 


This  section  is  concerned  with  issues  of  locking  and 
deadlock  detection  mechanisms  in  distributed  databases.  The 
problem  of  system  deadlocks  in  mult iprograming  and  multipro¬ 
cessing  systems  has  received  considerable  attention  in  the 
literature  and  is  well  understood  [HABE  69,  SHOS  70, 
COFF  71].  There  are  three  approaches  to  the  treatment  of 
deadlocks:  deadlock  prevention,  deadlock  avoidance  and 
deadlock  detection  and  resolution. 


Deadlock  prevention  requires  that  all  the  resources  be 
acquired  at  once  by  a  transaction.  This  requirement  cannot 
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always  be  satisfied  in  a  database  environment  since  the 
resource  needs  of  a  transaction  may  be  data  dependent  and 
not  precisely  known  at  the  start  of  the  transaction.  There¬ 
fore  it  would  be  necessary  for  a  transaction  to  acquire  all 
possible  resources  required  thereby  decreasing  system  con¬ 
currency  . 

Deadlock  avoidance  requires  some  advance  knowledge  of 
the  resource  usage  of  transactions  in  order  to  determine  at 
each  point  in  time  whether  there  is  a  valid  sequence  of  ac¬ 
tions  of  the  already  initiated  but  not  yet  completed  tran¬ 
sactions  such  that  all  of  them  can  be  run  to  completion. 
Again,  this  approach  is  not  practical  in  distributed  data¬ 
bases  since  the  necessary  advance  information  to  avoid 
deadlocks  is  either  absent  or  is  distributed  enough  to 
render  inefficient  any  attempt  to  avoid  deadlocks. 

Deadlock  detection  can  be  done  by  searching  for  cycles 
in  a  "state  graph"  [COFF  71].  A  method  for  the  detection 
and  resolution  of  deadlocks  in  a  centralized  database  system 
was  presented  in  [KING  731*  In  distributed  databases,  how¬ 
ever,  it  is  not  efficient  to  maintain  a  global  state  graph 
for  the  whole  system.  Two  methods  for  detecting  deadlocks 
in  distributed  databases,  which  do  not  require  that  a  global 
graph  be  built  and  maintained,  are  presented  in  this  section 
-  one  hierarchical  and  one  distributed.  An  outline  of  the 


proof  that  both  protocols  will  detect  all  existing  deadlocks 
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is  given  here.  For  the  case  of  the  hierarchical  protocol 
the  problem  of  establishing  the  hierarchy  in  a  way  such  that 
minimizes  the  cost  of  using  the  protocol  is  introduced. 
1-1.1  -  Formal  Model  Transaction  Processing 


This  section  introduces  the  necessary  notation  and  for¬ 
malism  upon  which  we  base  the  locking  protocols  and  deadlock 
detection  mechanisms  presented  in  this  chapter.  The  data¬ 
base  is  considered  to  be  distributed  among  n  sites,  SI,  S2, 
...,  Sn ,  of  a  computer  network.  Users  interact  with  the  da¬ 


tabase  via  transactions.  A 


is  a  sequence  of  ac¬ 


tions  which  can  be  either  read,  write,  lock  or  unlock  opera¬ 
tions.  Transactions  are  assumed  to  be  two-phase,  i.e.  once 
an  unlock  operation  was  issued  no  other  lock  can  be  request¬ 
ed  by  the  transaction.  As  shown  in  [ESWA  76]  this  is  neces¬ 


sary  to  preserve  the  database  consistency.  If  the  actions 
of  a  transaction  involve  data  at  a  single  site  the  transac¬ 


tion  is  called  local  as  opposed  to  a 


transaction 


which  involves  resources  at  several  sites.  We  assume  that 
distributed  transactions  are  implemented  as  a  collection  of 
processes  which  act  on  behalf  of  the  transaction.  Those 


processes  are  called 


There  may  be 


one  or  more  incarnations  of  the  same  transaction  at  each 
participating  site.  A  transaction  incarnation  is  responsi¬ 
ble  among  other  things  for: 


a.  acquiring,  using  and  releasing  resources  local  to 
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the  site  at  which  it  is  executing  as  needed  by 
the  transaction. 


b.  exchanging  messages  with  remote  incarnations  of 
the  same  transaction  for  purposes  of  cooperating 
with  incarnations  located  at  foreign  sites. 


Note  that  this  model  of  a  transaction  execution  is  gen¬ 
eral  enough  to  accommodate  other  models. 


A  transaction  can  be  in  two  different  states,  namely 
active  and  blocked .  A  transaction  is  blocked  if  its  execu¬ 
tion  cannot  proceed  because  a  needed  resource  is  being  held 
by  another  transaction,  and  the  transaction  is  active  other¬ 
wise.  We  introduce  now  a  graphic  model  which  depicts  the 
state  of  execution  of  all  transactions  in  the  system.  This 
model  is  in  the  form  of  a  graph  called  the 
transaction  wait  for  graph  or  TWF  graph.  The  nodes  of  this 
graph  are  associated  with  transaction  incarnations  and  are 
labeled  by  the  pair  ( transac t ion_name ,  site_name).  Note 
that  labeling  transaction  incarnations  with  the  pair 
(transaction  name,  site  name)  provides  unique  global  names 
for  these  nodes  if  there  is  at  most  one  transaction  incarna¬ 
tion  per  site  per  transaction.  If  more  than  one  incarnation 
is  to  be  allowed  then  distinct  local  names  sho'Uld  be  as¬ 
signed  for  them.  Since  this  distinction  is  irrelevant  for 
the  forthcoming  discussion  we  will  assume  that  there  is  only 
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one  incarnation  of  any  transaction  per  site. 


there  is  a  directed  arc  from  node  (Ti,S)  to  node 
(Tj,S)  if  the  incarnation  of  transaction  Ti  at 
site  S  is  blocked  and  waiting  for  the  incarnation 
of  transaction  Tj  to  release  a  resource  needed  by 
Ti .  (Ti,S)  is  said  to  be  in  "resource  wait"  for 
(TJ,S) . 


there  is  a  directed  arc  from  node  (T,Si)  to  node 
(T,Sj)  if  the  incarnation  of  transaction  T  at  site 
Si  is  blocked  and  waiting  for  a  message  from  the 
incarnation  of  T  at  site  S j .  (T,Si)  is  said  to  be 
in  "message  wait"  for  (T,Sj). 


It  can  be  easily  seen  that  the  existence  of  a  cycle  in 
the  transaction_wait_f or  graph  is  a  necessary  and  sufficient 
condition  for  a  deadlock  to  occur. 


Figure  3*10  shows  an  example  of  a  transaction_wait_f or 
graph  for  a  network  with  two  sites  SI  and  S2.  This  graph 
shows  two  deadlock  cycles.  One  of  them  is  a  local  deadlock 
since  it  involves  only  incarnations  of  transactions  at  site 
S2  and  therefore  only  resources  local  to  S2.  The  other 
deadlock  cycle  spans  both  sites  and  is  an  example  of  what  we 
call  a  global  deadlock. 
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Figure  3.10  -  A  Transaction_Wait_For  Graph 
for  a  network  with  two  sites  SI  and  S2. 
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While  the  centralized  method  may  be  practical  and  effi¬ 


cient  for  local  networks  it  may  impose  fairly  large  communi¬ 
cations  cost  in  geographically  distributed  systems.  This 
observation  stems  from  the  fact  that  the  central  deadlock 
detector  may  be  located  very  "far"  from  some  of  the  sites  in 
the  network.  As  an  example,  we  certainly  do  not  want 
deadlocks  which  only  involve  resources  located  at  some  host 
computers  located  in  Southern  California  to  be  detected  at 
the  East  Coast.  The  hierarchical  approach  to  deadlock 
detection  presented  in  the  next  section  lends  itself  to  an 
optimization  by  which  deadlocks  can  be  detected  by  a  site 
which  is  located  as  "close"  as  possible  to  the  sites  in¬ 
volved  in  the  cycle. 

2.1.1  -  a.  Hlcrarshlchally  Qr&anlacd  LpcKIdk  and  Deadlock 
Detection  Protocol 

Let  the  database,  DB,  be  partitioned  into  a  set  of 
subdatabases  DBi's  such  that  DB  is  the  union  of  all  the 
DBi's  and  DBi  and  DBj  are  disjoint  for  i  i  j.  The  locking 
and  deadlock  detection  mechanism  presented  here  has  as  its 
core  a  hierarchy  of  lock  controllers  which  interact  in  a 
way  to  be  explained  in  this  section.  First,  we  are  going  to 
distinguish  between  the  controllers  which  are  at  the  botto- 
most  level  of  the  hierarchy,  called  leaf-controllers 
and  the  ncn-lsaf  controllers  or  KLKa • 


or  LKs  . 


A  leaf  controller,  LKi,  is  assigned  to  each  subdatabase 
DBi.  In  the  example  shown  in  figure  3.11  we  have  three  leaf 
controllers  LKI,  LK2  and  LK3  and  two  non-leaf  controllers 


NLKO  and  NLK 1 . 


Each  leaf  controller,  LKi, 


maintains 


transaction_wait_for  graph,  TWF(LKi).  This  graph  contains 
all  the  nodes  of  the  global  TWF  associated  with  transaction 
incarnations  local  to  LKi.  In  addition,  two  special  types 
of  nodes  called  output  port  nodes  and  Input  port  nodes  are 
introduced  in  the  TWF  of  a  leaf  controller.  These  nodes  are 
associated  with  the  arcs  of  the  global  TWF  which  Join  in¬ 
carnations  in  two  distinct  controllers  and  are  defined  as 
follows . 


a.  a  node  in  the  TWF  graph  of  LKi  is  called  an  output 
port  and  denoted  0(LKi,T)  if  the  global  TWF  con¬ 
tains  an  outgoing  arc  from  an  incarnation  of 
transaction  T  local  to  LKi  into  a  non  local  incar¬ 
nation  of  T. 


a  node  in  the  transaction_wait_for  graph  of  LKi  is 
called  an  input  port  and  denoted  I(LKi,T)  in 
TWF(LKi)  if  in  the  global  TWF  there  is  an  incoming 
arc  into  an  incarnation  of  transaction  T  local  to 
LKi  from  a  non  local  incarnation  of  T. 
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Note  that  labels  assigned  to  input  and  output  ports  are 
unique  since  there  is  only  one  transaction  incarnation  per 
transaction  per  site.  In  the  example  of  figure  3  •  1  1  ,  the 
output  port  0(LK1,T1)  and  the  input  port  I(LK2,T1) 
correspond  to  an  arc  in  the  global  TWF  from  TlgLKI  (the  in¬ 
carnation  of  T1  at  LK1)  to  T1@LK2.  The  dashed  lines  indi¬ 
cate  arcs  in  the  global  TWF.  These  arcs  are  represented  ex¬ 
plicitly  at  the  upper  levels  of  the  hierarchy  as  will  be 
explained  below. 


Non-leaf  controllers  maintain  a  graph  called 
input-output- ports  ( IOP )  graph.  Nodes  of  an  IOP  are  associ¬ 
ated  with  input  and  output  ports  of  leaf-controllers.  We 
will  refer  to  them  as  i-nodes  and  o-nodes  respectively. 
Some  of  the  i-nodes  may  be  themselves  input  ports  for  the 
IOP  and  some  of  the  o-nodes  may  be  output  ports  for  the  IOP. 
The  IOP  for  controller  NLKi  ,  denoted  IOP(NLKi),  is  defined 
by  the  following  rules: 


R1  -  Arcs  from  i-nodes  can  go  only  to  o-nodes  and  vice- 
versa. 


R2  -  There  is  an  arc  from  o-node  Oa  to  i-node  lb  if  Oa  is 
an  output  port  of  a  leaf  controller  in  the  subtree  rooted 
at  NLKi  and  lb  is  a  corresponding  input  port  of  another 
leaf  controller  in  the  same  subtree.  In  the  example  of 
figure  3.11,  there  is  an  a^c  from  0(LK1,T1)  to  I(LK2,T1) 
in  the  IOP  of  NLKI  since  0(LK1,T1)  is  an  output  port  of 
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LK1  and  I(LK2,T1)  is  its  corresponding  input  port  in 
LK2 .  LK1  and  LK2  are  in  the  subtree  rooted  by  NLKI. 


R3  -  There  is  an  arc  from  the  i-node  la  to  o-node  Ob  in 
IOP(NLKi)  if  there  is  a  path  from  an  input  port  la  to  an 
output  port  Ob  of  a  son  of  NLKi.  In  the  example  of  fig¬ 
ure  3.11  there  is  an  arc  from  I(LK1,T4)  to  0(LK2,T9)  in 
NLKO  since  in  NLKI  there  is  a  path  between  I(LK1,T4)  and 
0 ( LK2 , T9  )  • 

R4  -  An  input  (output)  port  of  IOP(NLKi)  is  also  an  input 
(output)  port  of  a  leaf  controller  in  the  subtree  rooted 
by  NLKi.  In  the  example  of  figure  3*11  the  input  port  of 
IOP(NLKI)  is  also  an  input  port  of  LK1. 

1-1- 1-1  -  fllArarshlcal  Protocol  -  a.  Description 

Before  we  describe  the  protocol  let  us  define  the 
lowest  common  ancestor  between  controllers  K 1 ,  K2,  . .  .  ,  Kn  , 
denoted  lca( K 1 , K2 , . . . , Kn ) ,  as  the  common  ancestor  between 
them  at  the  lowest  level  in  the  hierarchy  (the  root  is  at 
the  highest  level).  Rules  R1  through  R3  below  describe  the 
hierarchical  protocol. 

ROLE  1:  ( transaction  incarnation  X  requests  a.  local 
resource ) :  The  requested  resource  R  is  in  the  same  subda¬ 
tabase  as  the  transaction  incarnation  T.  Let  LKi  be  the 
controller  for  resource  R. 


JB_L  •  J_ :  [the  resource  cannot  be  granted]  Let  {  T  1  ,  T2, 
...,Tk}  be  the  set  of  transactions  which  currently 
hold  resource  R.  Add  an  arc  from  (T,LKi)  to  (Tj,LKi) 
for  j  =  1  to  k.  Check  the  transaction_wait_f or  graph  at 
LKi  for  the  existence  of  cycles. 


If  cycles  were  formed  then  one  or  more  lo¬ 
cal  deadlocks  have  been  detected  and  an  appropriate 
action  is  required  for  deadlock  resolution. 


jLL-JL*.2i  The  addition  of  the  arcs  mentioned  in  Rule 
R1.1  may  have  created  one  or  more  paths  between 
input  and  output  ports  of  LKi.  For  each  such  path 
send  the  (input  port,  output  port)  pair  which  del¬ 
imits  the  path  to  the  father  of  LKi. 


RULE  1- 


a  Don-local 


The  requested  resource  R  is  in  a  different  subdatabase 
from  the  previously  requested  resource.  Therefore  it  has 
to  be  acquired  by  an  incarnation  of  T  local  to  R.  Let 
LKi  be  the  controller  for  the  previously  requested 
resource  and  let  LKJ  be  the  controller  for  resource  R. 
The  incarnation  of  T  at  LKi  becomes  blocked  and  waiting 
for  a  message  from  the  incarnation  of  T  at  LK j .  The  node 
(T,LKi)  is  now  an  output  port  of  the  transaction_wait_f or 
graph  at  LKi  and  the  node  (T,LKJ)  is  an  input  port  of  the 
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tranaaction_wait_for  graph  at  LKj. 

JL2..J.:  An  arc  from  0(LKi,T)  to  I(LKj,T)  is  created  in 
the  IOP  graph  of  the  lowest  common  ancestor  between 
LKi  and  LKj. 

JL2.. Z-  An  o-node  labeled  0(LKi,T)  is  added  to  the  IOP 
graph  of  each  controller  in  the  path  between  LKi  and 
lea ( LKi , LK j ) .  Each  such  o-node  is  also  an  output 
port  of  the  corresponding  IOP  graph. 

JL2.-3.5  An  i-node  labeled  I(LKj,T)  is  added  to  the  IOP 
graph  of  each  controller  in  the  path  between  LKj  and 
lca( LKi  ,  LK j  )  .  Each  such  i-node  is  also  an  input  port 
of  the  corresponding  IOP  graph. 

The  protocol  followed  by  a  non-leaf  controller,  NLKi , 
is  described  by  rule  3  below. 

RULE  3.:  an  arc  is  added  to  IOP(NLKi). 

JB3. •  J. •  If  a  cycle  is  generated  by  the  addition  of  the 
new  arc  then  a  global  deadlock  has  been  detected  and 
an  appropriate  action  is  required  to  resolve  it. 


JL2..2.:  If  no  cycle  was  generated  check  whether  any  in¬ 
put  output  port  connection  has  been  generated  in 
IOP(NLKi)  and  report  the  endpoints  of  any  such  connec¬ 
tions  to  the  father  of  NLKi. 

We  have  not  yet  mentioned  how  lock  releases  are  re¬ 
flected  in  the  appropriate  graphs  in  the  hierarchy.  Let 
each  controller  (LK  or  NLK)  maintain  a  list  of  the  i-o  paths 
(i.e.  the  paths  which  connect  input  ports  to  output  ports) 
in  its  graph.  A  possible  representation  for  this  list  could 
be  in  the  form  of  a  bit  matrix  where  each  row  corresponds  to 
an  i-o  path  and  each  column  is  associated  with  an  arc  in  the 


graph.  The  value  • 1 •  in  the  i,J  entry  of  this  matrix  indi¬ 
cates  that  are  j  is  in  the  i-o  path  i.  An  unlock  operation 
causes  an  arc  (maybe  more)  to  be  deleted  from  a  TWF  graph  of 
a  leaf  controller.  All  the  i-o  paths  (if  any)  which  con¬ 
tained  this  arc  are  broken.  This  may  be  reported  to  the  fa¬ 
ther  of  the  LK .  There,  the  arcs  which  represented  the  bro¬ 
ken  i-o  paths  will  be  used  to  find  which  i-o  paths  were  bro¬ 
ken  in  the  non-leaf  controller.  This  propagation  continues 
up  in  the  hierarchy  until  the  deletion  of  an  arc  from  a 
graph  does  not  cause  any  i-o  path  to  be  broken. 

The  method  described  above  for  deadlock  detection  re¬ 
quires  that  non-leaf  controllers  be  kept  up  to  date  continu¬ 
ously.  Other  variations  can  be  used  when  appropriate.  For 
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of  the  non-leaf  controller  which  is  the  lowest 
common  ancestor  between  the  LKi's. 

Property  a  follows  directly  from  Rule  1  of  the  proto¬ 
col.  The  validity  of  property  £  is  not  so  straightforward 
and  will  be  shown  to  hold  by  the  following  theorems. 

Let  us  introduce  some  notation  first.  Let  the  arcs 
which  connect  an  i-node  to  an  o-node  in  an  input-output- 
ports  graph  be  called  i->o  arcs  and  let  o->i  arcs  be  the 
ones  which  connect  o-nodes  to  i-nodes.  Let  tree ( K )  be  the 
subtree  of  the  hierarchy  rooted  at  controller  K.  Let  us  now 
extend  the  notation  TWF(K)  to  indicate  the  subgraph  of  the 
global  TWF  obtained  by  considering  only  the  resources  con¬ 
trolled  by  all  the  LKs  in  tree(K). 

Let  us  show  that  if  a  cycle  is  detected  in  the  IOP 
graph  of  a  non-leaf  controller  chere  is  a  deadlock. 

THEOREM  3.9:  If  there  is  a  cycle  in  the  input-output-ports 
graph  of  a  non-leaf  controller  NLKi  then  there  is  a 
deadlock . 

Proof:  We  need  to  prove  that  if  there  is  a  cycle  C  in  the 
IOP  graph  of  a  non-leaf  controller  NLKi  there  is  an  associ¬ 
ated  cycle  C'  in  TWF(NLKi).  Let  us  show  how  to  construct 
the  cycle  C*.  Let  C  be  a  cycle  in  IOP(NLKi).  For  the  pur¬ 
pose  of  this  construction  let  us  label  each  i->o  arc  (IJ,OJ) 


in  C  with  the  label  Kj  if  KJ  is  an  immediate 


son  of  NLKi 


such  that  the  creation  of  an  input  output  port  connection 
in  Kj  caused  the  arc  (Ij,Oj)  to  be  created  in  IOP(NLKi) 
see  rules  R1.1.2  and  R3.2  of  the  protocol.  Arcs  of  the  type 
o->i  in  IOP(NLKi)  will  not  be  labeled.  Using  the  notation 
(a,b,c)  to  indicate  an  arc  labeled  £.  from  node  a  to  node  £ 
let  the  cycle  C  be  C  =  (11,01, K1),  (01,12,-),  (I2,02,K2), 

...,  (In,0n,Kn),  (On, II,-). 

The  repeated  application  of  the  operation  1  below  will 
transform  the  cycle  C  into  a  cycle  in  which  all  the  labels 
indicate  leaf  controllers. 

Operation  J_:  If  Kj  in  (Ij,0j,Kj)  is  a  non-leaf  con¬ 
troller,  replace  (IJ,  Oj,KJ)  by  a  path  connecting  IJ  to 
Oj  in  IOP(KJ) . 

After  this  transformation,  there  is  a  path  in  the  TWF 
of  an  LK  in  tree(NLKi)  associated  with  each  i->o  arc  in  C. 
There  is  also  an  arc  between  incarnations  of  transactions  in 
the  TWFs  of  distinct  LKs  associated  with  each  o->i  arc  in  C. 
More  precisely,  for  each  i->o  arc,  (Ii,0i,Ki),  in  C  there 
is  a  path  in  TWF(Ki)  between  the  input  port  Ii  and  the  out¬ 
put  port  Oi  of  TWF(Ki).  Now,  by  rule  R2.1  of  the  protocol 
each  o->i  arc,  (0i,Ii+1,-),  in  C  connects  two  incarnations 
of  the  same  transaction.  These  incarnations  are  local  to 
leaf  controllers  in  tree(NLKi).  Finally,  the  sum  of  all 
the  paths  and  arcs  thus  obtained  defines  the  cycle  C'  in 
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TWF ( NLKi  )  . 

In  order  to  conclude  the  proof  we  must  show  that  the 
arcs  in  the  cycle  C'  defined  above  exi3t  simultaneously  and 
therefore  C'  is  a  deadlock  cycle.  Assume  not.  Then  there 
is  at  least  a  pair  of  arcs  Ti  ->  Tj  and  Tm  ->  Tn  which  did 

not  appear  simultaneously  in  C*  and  such  that  there  is  a 

path  from  Tj  to  Tm  in  TWF(NLKi).  Consider  first  the  case  in 
which  Ti  ->  Tj  existed  (i.e.  appeared  and  disappeared)  be¬ 
fore  Tm  ->  Tn .  Then,  transaction  Tj  released  the  resource 
it  was  holding  (which  was  needed  by  Ti).  This  implies  that 
all  the  transactions  in  the  path  from  Tj  to  Tm  in  TWF(NLKi) 
must  have  had  released  at  least  one  resource  also.  Since 

transactions  are  assumed  to  be  two-phase,  then  none  of  them 

(including  Tm)  can  issue  any  further  lock  requests.  There¬ 
fore  the  arc  Tm  ->  Tn  cannot  exist.  This  contradicts  our 
assumption  and  shows  that  it  is  not  possible  for  Ti  ->  Tj  to 
have  existed  before  Tm  ->  Tn .  Consider  now  the  case  in 
which  Tm  ->  Tn  existed  before  Ti  ->  T j .  This  case  is  per¬ 
fectly  possible  so  long  as  these  two  arcs  are  not  part  of 
the  same  cycle  otherwise  one  would  have  to  conclude  that 
every  arc  in  the  cycle  existed  before  itself.  Therefore, 
all  the  arcs  in  C*  coexist  in  TWF(NLKi)  and  C*  represents  a 
deadlock  cycle  and  the  theorem  i3  proved. 

We  want  to  show  now,  that  given  a  cycle  in  the  global 
TWF  there  is  a  corresponding  cycle  in  one  of  the  controllers 
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in  the  hierarchy.  In  order  to  state  this  property  more  pre¬ 


cisely  some  definitions  are  in  order.  Let  the  graph  G'  be 
called  an  arc-condensation  or  simply  condensation  of  the 
graph  G  if  the  node  set  of  G'  is  a  subset  of  the  node  set  of 
G  and  if  (v1,vn)  is  an  arc  in  G'  then  there  is  a  path  from 
node  vl  to  node  vn  in  the  graph  G. 

THEOREM  3.10:  Given  a  cycle  C  in  the  global  TWF  there  is  an 
arc  condensation  of  C  in  one  and  only  one  controller  in  the 
hierarchy.  This  controller  is  the  lowest  common  ancestor  of 
all  the  controllers  which  manage  the  resources  in  the  cycle 
C. 

Proof:  Let  C  be  a  cycle  in  the  global  TWF  which  involves 
resources  controlled  by  leaf  controllers  LK  1  ,  LK2,  ....  LKn . 
Each  of  the  n  leaf  controllers  only  know  the  portion  of  the 
cycle  which  contains  resources  local  to  the  controller.  Let 
us  consider  an  initial  condensation  of  the  cycle  C  obtained 
by  substituting  every  path  in  the  TWF  from  an  input  port  to 
an  output  port  of  a  leaf  controller  by  a  single  arc  connect¬ 
ing  these  ports.  For  the  purpose  of  this  proof  let  us  in¬ 
troduce  a  representation  for  a  cycle  which  clearly  illus¬ 
trates  how  the  knowledge  about  portions  of  the  cycle  is  dis¬ 
tributed  throughout  several  controllers  in  the  hierarchy. 
So,  let  a  cycle  be  represented  by  a  sequence  of  labeled  arcs 
( 1 0 , 00 , K0 ) ,  ( I  1 , 0 1 , K 1 ) ,  ...,  ( I j  »  0 j , K j ) ,  ..., 
( In- 1 , On- 1 , Kn- 1 )  where  (IJ,0j,KJ)  indicates  that  there  is  a 
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1 0 P ( K )  since  every  time  a  connection  between  an 
input  and  an  output  port  is  created  it  is  propa¬ 
gated  to  its  father.  See  rules  R1.1.2  and  R3.2  of 
the  protocol. 

c.  Since  the  controller  KO  is  not  a  son  of  controll¬ 
er  K  the  input  port  II  is  also  an  input  port  of 
K  which  is  in  the  path  between  KO  and  lca(K1,K0). 
See  rule  R2.3  of  the  protocol.  The  same  observa¬ 
tion  applies  to  controller  Kl+1  and  the  output 
port  01  (see  rule  R2.2  of  the  protocol). 


Therefore  when  all  the  labels  are  the  same  we  have  a 
representation  of  a  condensed  version  of  the  original  cycle. 
The  common  label  is  the  name  of  the  controller  where  the  cy¬ 
cle  is  completely  represented.  Since  operations  1  and  2 
preserve  the  lowest  common  ancestor  between  the  controllers 
involved  in  the  cycle  the  original  cycle  will  be  represented 
in  a  condensed  way  in  lea(LK1,  LK2,  ...»  LKn). 

Theorems  1  and  2  together  give  us  a  necessary  and  suf¬ 
ficient  condition  for  a  deadlock  to  be  detected  in  a  distri¬ 
buted  database  involving  resources  at  different  sites.  This 
result  will  be  stated  as  the  following  theorem. 

THEOREM  3.11:  A  necessary  and  sufficient  condition  for  a 

deadlock  involving  resources  in  controllers  LK1,  LK2 . 

LKn  to  exist  is  that  there  is  a  cycle  in  the  IOP  graph  of 
lea ( LK 1 ,  LK2,  .  ..,  LKn).  This  cycle  is  a  condensation  of 
the  corresponding  cycle  in  the  global  TWF . 

l-l-l-l  -  Jlsadlaak.  £-g.a<?UiU<?.Ji 

As  pointed  out  in  an  earlier  section,  we  are  not  going 
to  examine  in  this  dissertation  the  criteria  involved  in  op¬ 
timal  deadlock  resolution  since  this  is  mainly  a  policy  is¬ 
sue.  This  section  discusses  however  the  mechanisms  which 
are  necessary  to  allow  the  implementation  of  any  such  poli¬ 
cy.  The  reader  may  have  noticed  that  the  condensed  cycle  in 
the  IOP  graph  contains  less  information  than  the  correspond- 
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ing  cycle  in  the  TWF  graph.  In  particular,  not  all  the 
transactions  which  participate  in  the  cycle  in  the  TWF  graph 
appear  in  the  IOP  graph. 

One  way  to  compensate  for  this  loss  of  information  is 
to  require  that  whenever  an  i-o  arc  is  received  by  an  NLK, 
the  name  of  the  controller  which  generated  the  arc  be  stored 
with  the  arc.  Then,  when  an  NLK  detects  a  deadlock  cycle  it 
can  send  down  the  tree,  to  its  appropriate  sons,  a  message 
which  will  continue  to  propagate  down  (through  the  appropri¬ 
ate  sons)  until  it  reaches  the  leaves  of  the  tree.  At  this 
point  the  LKs  can  report  directly  to  the  NLK  which  detected 
the  deadlock  all  the  necessary  information  to  implement  the 
desired  policy  for  deadlock  resolution. 

Notice  that  the  additional  messages  necessary  to  sup¬ 
port  the  above  described  mechanism  do  not  substantially  in¬ 
crease  the  total  communications  cost  of  the  protocol  since 
they  must  only  be  sent  when  deadlocks  are  detected  and  not 
during  normal  operation. 

Another,  less  flexible  alternative,  is  to  select  the 
transaction  to  be  preempted  from  those  which  appear  in  the 
IOP  graph  only.  While  no  additional  messages  are  required 
here  it  is  likely  that  a  non  optimal  choice  will  be  taken  in 
resolving  the  deadlock. 


2-2-l.JL  -  Hierarchy  Salami, ahasnt 

So  far  we  have  assumed  the  existence  of  the  hierarchy 
without  considering  how  it  is  established  in  the  first 
place.  The  performance  of  the  hierarchical  protocol,  in 
terms  of  the  overhead  message  traffic  incurred  by  it,  can  be 
minimized  if  the  hierarchy  is  appropriately  chosen.  This 
choice  should  consider  the  pattern  of  the  DB  traffic  with 
respect  to  the  locality  of  access  to  a  controller  or  group 
of  controllers.  For  instance,  assume,  that  one  is  able  to 
identify  groups  or  clusters  of  leaf  controllers  such  that  a 
high  percentage  of  the  DB  traffic  involves  controllers  in 
the  same  cluster  while  very  little  traffic  is  of  the  inter 
cluster  type.  One  possibility  here  would  be  to  assign  non¬ 
leaf  controllers  to  each  cluster  and  try  to  further  cluster 
the  non-leaf  controllers  or  put  all  of  them  together  under 
the  same  root. 

The  problem  in  general  can  be  stated  as  follows.  Given 
a  set  of  leaf  controllers,  assigned  to  the  nodes  of  a  com¬ 
puter  network,  given  the  DB  traffic  pattern  and  given  the 
cost  of  sending  messages  between  every  pair  of  nodes  in  the 
network,  find  a  hierarchy  which  minimizes  the  total  cost  in¬ 
curred  in  using  the  protocol. 

There  are  clearly  some  heuristic  rules  which  if  applied 
to  a  given  hierarchy  result  in  another  hierarchy  of  less 
cost.  The  general  optimization  problem,  however,  is  the 
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subject  of  current  research  effort. 


A  locking  protocol  which  uses  distributed  control  and  a 
distributed  deadlock  detection  mechanism  is  presented  here. 
Before  we  describe  the  protocol,  some  definitions  are  in 
order.  Each  database  site  controls  a  set  of  resources. 
Transactions  request  resources  by  sending  their  requests  to 
the  controller  of  the  resource.  Each  controller  is  respon¬ 
sible  for: 

1 .  processing  lock  and  lock  release  requests  for  lo¬ 
cal  resources.  Requests  may  originate  from  any 
node  in  the  network. 

2.  building  a  simplified  version  of  the 

transaction_wait_f or  graph  and  detecting 

deadlocks.  The  TWF  maintained  by  a  controller  is 

a  subgraph  of  the  global  TWF. 

For  the  purpose  of  stating  this  algorithm  we  will  use  a 
simplified  version  of  the  t ransac t ion_wa i t_f or  graph.  The 
reader  should  be  aware  that  this  version  is  a  redefinition 
of  the  graph  used  in  the  section  on  the  hierarchical  proto¬ 
col  although  the  same  name  for  the  graph  is  retained.  In 


this  version  there  is  no  notion  of  transaction  incarnations. 


i 


Nodes  are  associated  with  transactions  and  there  is  a 
directed  arc  from  transaction  T'  to  transaction  T ' ’  if  T'  is 
blocked  and  must  wait  for  T'*  to  release  a  resource  (not 
necessarily  a  resource  needed  by  T')  before  T'  is  able  to 
proceed . 


Some  definitions  are  in  order.  A  non-1 


transac¬ 


tion  is  a  node  in  the  transact ion_wait_f or  graph  with  no 
outgoing  arcs  or  a  sink  node.  Let  us  now  define 
blocking  set(T)  as  the  set  of  all  non-blocked  transactions 
which  can  be  reached  by  following  a  directed  path  in  the  TWF 
graph  starting  at  the  node  associated  with  transaction  T. 
This  is  the  set  of  transactions  which  are  ultimately  block¬ 
ing  transaction  T.  The  pair  (T,T’)  is  said  to  be  a  blocking 
pair  of  T  if  T‘  is  in  blocking_set ( T ) . 


The  execution  of  a  transaction  can  be  described  as  fol¬ 
lows.  A  transaction  has  a  site  of  origin  which  is  the  site 
where  the  transaction  entered  the  system.  The  transaction 
starts  running  at  this  site,  performing  local  operations  un¬ 
til  operations  on  non-local  data  are  necessary.  Then,  a 
lock  request  in  the  appropriate  mode  is  built  and  sent  to 
the  controller  for  the  requested  resource.  This  controller 
will  either  accept  or  reject  the  lock,  sending  the  reply  to 
the  site  of  origin  of  the  transaction.  If  there  are  multi¬ 
ple  copies  of  data,  lock  requests  have  to  be  sent  to  all 
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Some  comments  are  in  order. 

a.  the  arcs  of  the  transaction_wait_for  graph  con¬ 

sidered  here  may  represent  one  of  two  types  of  re¬ 
lationships  between  transactions,  namely  a  direct 
wait  and  an  indirect  wait.  Transaction  T1  is  said 
to  be  waiting  directly  on  T2  if  the  resource  need¬ 
ed  by  T1  for  its  continued  execution  is  being  held 
by  T2.  A  transaction  T1  is  said  to  be  waiting  in¬ 
directly  on  Tk  if  there  is  a  set  of  transactions 
T2,  T3,  ....  Tk-1  such  that  Ti  is  waiting  directly 

on  Ti+1  for  i  s  1  to  k-1. 

b.  from  the  previous  observation  it  can  be  seen  that 
cycles  in  a  TWF  graph  may  be  a  condensation  (in 
the  sense  defined  in  the  section  on  the  hierarchi¬ 
cal  protocol)  of  the  cycle  that  would  exist  in  the 
global  transaction_wait_for  graph  for  the  whole 
system . 


1-1. A. 2.  - 


That  the  protocol  described  in  the  previous  section  is 
able  to  detect  all  deadlocks  is  shown  in  the  following 


theorem . 


THEOREM  3.12:  The  above  described  protocol  detects  all  pos¬ 
sible  deadlocks. 

Proof:  In  order  to  show  this  result  we  will  consider  a  glo¬ 
bal  deadlock  cycle,  as  shown  in  figure  3  -  1 ^ »  and  we  will 
show  that  this  cycle  will  appear  in  the  TWF  of  the  site  of 
origin  of  at  least  onf  of  the  transaction  in  the  cycle. 

There  are  many  orderings  of  resource  requests  that  can 
lead  to  the  same  deadlock  cycle.  Each  time  a  request  is  re¬ 
jected  the  newly  formed  blocking  pairs  will  increase  the 
knowledge  that  some  controllers  have  about  the  global  graph. 
Therefore,  the  bigger  the  advance  knowledge  obtained  when  a 
request  is  rejected,  more  rapidly  the  deadlock  will  be 
detected.  The  ordering  of  resource  requests  used  in  this 
proof  is  such  that  controllers  will  get  the  minimum  possible 
knowledge  of  the  rest  of  the  graph  when  a  request  is  reject¬ 
ed.  The  reader  should  not  have  any  problem  in  convincing 
himself  that  if  the  theorem  holds  for  this  ordering  then  it 
holds  for  any  other.  The  chosen  ordering  is  the  follow¬ 
ing.  Initially,  transactions  T1  through  Tk-1  are  blocked 
and  each  of  the  controllers  at  the  site  of  origin  of  these 
transactions  have  the  knowledge  of  a  single  transaction 
ahead  in  the  cycle.  At  this  point  the  t rans ac t ion_wai t_f or 
graph  at  the  site  of  origin  of  each  transaction  is  shown 


below . 


Sorig( T  1 ) 
Sor ig ( T2 ) 


:  T 1  ->  T 2 
:  T2  ->  T3 


Sorig(Tk-l)  :  Tk-1  ->  Tk 

Now,  when  Tk  makes  a  request  and  is  blocked  by  T1,  the 
blocking  pair  (Tk,T2)  will  be  sent  to  Sorig(T2)  where  a  new 
blocking  pair  (Tk,T3)  is  formed  and  sent  to  Sorig(T3). 
There  the  blocking  pair  (Tk,T4)  is  formed  and  sent  to 
Sorig(T4)  and  so  on  until  the  blocking  pair  (Tk,Tk-1) 
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reaches  Sorig(Tk-l).  This  causes  the  arc  from  Tk  to  Tk-1  to 
be  added  to  the  TWF  graph  at  that  site.  Since  this  graph 
already  contained  an  arc  from  Tk-1  to  Tk  a  cycle  is  formed 
and  the  deadlock  is  detected. 

1-1  -  Related  tLa.r.k 

This  section  describes  several  other  approaches,  sug¬ 
gested  in  the  literature,  for  synchronization  in  distribut¬ 
ed  databases.  The  list  of  work  is  not  intended  to  be  ex¬ 
haustive  but  it  is  rather  representative  of  the  spectrum  of 
proposed  solutions. 

lime.  Stamps  ( Thomas) 

The  scheme  suggested  by  Robert  Thomas  in  [THOM  76]  will 
be  referred  hereafter  as  the  Base  Variable  Time  Stamps  or 
BVTS  algorithm,  for  reasons  which  will  become  clear  in  the 


following  description.  The  BVTS  scheme  requires  that  the 


A 
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database  be  fully  replicated  at  every  site,  although  it  was 
indicated  in  [THOM  77]  that  this  condition  can  be  relaxed. 
The  following  description  considers,  for  simplicity,  the 
fully  redundant  case. 

Let  us  first  examine  how  an  update  request  is  formed 
and  how  it  is  either  accepted  or  rejected  by  the  set  of 
DBMPs  (Data  Base  Manager  Processes  -  there  is  one  DBMP  in 
every  site).  The  sequence  of  messages  necessary  to  form  an 
update  request  is: 

-  an  application  program  (AP)  requests  from  any  DBMP  a 
set  of  variables,  called  base  variables  ( BV ) ,  upon 
which  the  new  values  for  the  variables  to  be  updated 
(update  variables  or  UVs)  are  calculated. 

-  the  DBMP  replies  to  the  AP  by  sending  him  the  base 
variable  values  along  with  timestamps  associated  with 
them.  A  timestamp  for  a  base  variable  is  the  time  it 
was  last  modified. 


-  the  AP  calculates  the  values  for  the  update  variables 
(which  must  be  a  subset  of  the  base  variables)  and 
sends  to  any  DBMP  an  update  request  composed  of: 


i)  a  set  of  BVs  and  their  originally  obtained  times¬ 
tamps  . 


ii)  a  set  of  UVs  with  their  new  values. 
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When  a  DBMP  receives  an  update  request,  it  will  be  vot¬ 
ed  on  by  the  set  of  DBMPs  until  the  request  is  either  glo¬ 
bally  accepted  or  rejected.  The  voting  rules  can  be  found 
in  [THOMAS  76]  but  they  basically  amount  to: 

-  voting  OK  on  the  request  if  its  BV  timestamps  are 
current  and  if  it  does  not  conflict  with  a  pending  re¬ 
quest.  Two  requests  are  said  to  conflict  if  the  inter¬ 
section  of  the  set  of  UV  variables  of  one  with  the  set 
of  BV  variables  of  the  other  is  not  empty.  A  request 
is  said  to  be  pending  if  it  has  been  voted  OK  but  not 
globally  accepted  yet. 

-  voting  REJECT  if  the  BV  timestamps  are  not  current. 

One  REJECT  vote  is  enough  to  globally  reject  the  re¬ 
quest.  The  majority  of  the  DBMPs  must  vote  OK  on  a  request 
in  order  for  it  to  be  globally  accepted.  If  a  request  is 
rejected  it  must  be  resubmitted. 

In  the  BVTS  scheme,  update  requests  propagate  in  a 
daisy  chain  fashion  from  one  DBMP  to  another.  An  expression 
for  the  delay  D  experienced  by  an  update  request  from  the 
moment  the  base  variables  are  requested  until  the  update  re¬ 
quest  is  finally  accepted  (probably  after  being  resubmitted 
several  times)  can  be  obtained  using  a  simple  probabilistic 
model.  The  delay  was  found  to  be 

D  =  T* [  4*p  ♦  qr* ( p  -  1 )  +  qa  ]  (3-14) 


where  , 


T  s  average  message  delay  introduced  by  the  network. 

p  =  average  number  of  times  that  the  update  request 
must  be  submitted  until  it  is  finally  accepted. 

qr  =  average  number  of  DBMPs  that  vote  on  a  request 
each  time  it  is  submitted  and  rejected. 

qa  =  average  number  of  DBMPs  that  vote  on  a  request 
given  that  it  is  accepted. 

Let  us  find  D  for  the  best  case,  which  occurs  when  p  = 
1  (i.e.,  the  request  is  accepted  at  the  first  time  it  is 

submitted)  and  qa  s  n/2  (since  a  majority  consensus  must  be 
reached).  In  this  case  we  have, 

Dbest  =  T*(4  +  n/2)  (3.15) 

The  average  communication  cost,  Cupdt,  for  the  BVTS  al¬ 
gorithm  is 

Cupdt  =  2*M*[  qa  +  ( 1 +qr ) * ( p- 1 )  +  1  ]  (3.16) 

where  M  is  the  average  communication  cost  per  message  and  p, 
qr  and  qa  are  as  defined  before. 

Again  the  best  case  occurs  for  p  =  1  and  qa  =  n/2  giv¬ 
ing  us 
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ntereat  is  not  a  direct  function  of  the  size  of  the  net- 


From  (3-18)  and  (3.19)  it  can  be  seen  that  for  fully 


"phantom  tuples". 


The  BVTS  algorithm  requires  that  a  timestamp  be  stored 
in  the  database  with  every  modifiable  data  item.  This  re¬ 
quirement  causes  a  potential  storage  overhead,  which  can  be 
reduced  if  one  associates  timestamps  to  groups  of  elementary 
data  items  instead  of  attaching  a  timestamp  to  each  member 
of  the  group.  For  instance,  one  may  associate  timestamps  to 
entire  records  of  a  file  rather  than  associating  a  separate 
timestamp  to  each  field  of  the  record.  While  a  coarse  "time 
stamp  granularity"  implies  in  less  storage  overhead,  it  im¬ 
plies  a  higher  update  request  rejection  probability  and  con¬ 
sequently  in  a  lower  degree  of  concurrency  and  higher  com¬ 
munications  cost  and  delay.  For  the  CLC  protocol,  the 
storage  overhead  lies  in  the  fact  that  redundant  copies  of 
the  LOCK  table  must  be  stored.  However,  since  those  tables 
only  contain  entries  for  active  locks,  we  believe  that  in 
general,  the  storage  overhead  for  the  CLC  protocol  is  small¬ 
er  than  for  the  BVTS  algorithm. 

Pr3pr<?<?3?9tnK  ( Rothnie  and  Bernstein) 

The  approach  to  update  synchronization  taken  in  SDD-1, 
as  described  in  [ROTH  77  and  BERN  77],  takes  advantage  of 
the  fact  that  different  levels  of  synchronization  may  be  re¬ 
quired  by  different  types  of  transactions.  The  SDD-1  update 
methodology  prescribes  four  synchronization  protocols,  num- 
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bered  from  one  to  four,  with  increasing  degrees  of  synchron¬ 
ization.  If  the  types  of  transactions  which  will  be  run 
against  the  database  are  known  in  advance  they  can  be 
grouped  into  disjoint  classes  of  transactions  and  one  of  the 
four  synchronization  protocols  can  be  assigned  to  each 
class.  The  mapping  from  transactions  to  transaction  classes 
and  from  transaction  classes  to  protocols  can  be  constructed 
off  line  by  the  database  administrator  and  compiled  into 
tables  which  will  be  used  at  transaction  run-time  to  select 
the  appropriate  protocol  to  be  used. 

SDD-1  consists  of  a  collection  of  interconnected  da- 
tamodules.  Transactions  in  SDD-1  are  executed  in  two 
stages : 

1.  a  transaction  is  first  executed  at  its  site  of 
origin.  This  execution,  called  an  L  action  (for 
local),  generates  a  list  of  updates  to  the  data¬ 
base.  The  list  of  updates  is  timestamped  with  the 
time  at  which  its  L  action  was  executed  and  broad¬ 
cast  to  all  the  other  datamodules. 

2.  the  second  stage  is  the  processing  of  the  update 
list  at  a  remote  module.  This  processing  is 
called  an  U  action. 
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|  L  and  U  actions  are  assumed  to  be  atomic.  However,  L 
| 

and  U  actions  of  different  transactions  may  be  interleaved. 
The  four  synchronization  protocols  determine,  in  some  sense, 
the  degree  to  which  L  and  U  actions  may  be  interleaved.  The 
protocols  can  be  basically  described  as  follows: 

Protocol  J.: 

This  is  the  null  protocol.  In  other  words,  this  proto¬ 
col  provides  no  intermodule  synchronization.  It  is  the 
protocol  to  be  used  by  transactions  which  do  not  in¬ 
teract,  in  some  sense,  with  others. 

Protocol  Z: 

This  protocol  is  primarily  used  for  retrieval  transac¬ 
tions.  A  transaction  t2  is  said  to  satisfy  protocol  2 
with  respect  to  other  transactions  if  it  either  sees 
all  or  none  of  the  updates  generated  by  transactions 
that  precede  it.  A  transaction  is  said  to  pre- 
cedeanother  if  its  L  action  was  executed  before  the 
other,  as  indicated  by  the  timestamp. 

ExaLasaL  1’ 

A  transaction  t3  is  said  to  satisfy  protocol  3  with 
respect  to  other  transactions  if  it  sees  all  the  up¬ 
dates  to  the  database  generated  by  transactions  which 
precede  it. 

i 
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A  transaction  tU  is  said  to  satisfy  protocol  4  with 
respect  to  other  transactions  if  all  the  outstanding 
updates  are  executed  before  transaction  t4  executes  its 
L  action  and  if  all  the  updates  generated  by  t4  are 
performed  at  other  sites  before  any  new  transaction  is 
introduced  at  those  sites.  This  is  the  strongest  of 
the  four  protocols  and  it  is  the  one  applied  to  all 
unanticipated  transactions. 

In  the  particular  and  interesting  case  in  which  types 
of  transactions  are  known  in  advance,  the  SDD-1  solution  may 
represent  significant  savings  in  communications  cost  and  de¬ 
lay.  In  the  general  case,  however  the  expensive  protocol  4 
would  have  to  be  used.  For  this  more  general  case  we  advo¬ 
cate  the  use  of  the  solution  prescribed  by  the  CLC  protocol 
since  it  addresses,  among  other  things,  the  issue  of  robust¬ 
ness  and  crash  recovery  not  covered  by  the  SDD-1  approach. 

S&qwenUal  Ordering  ( Ellis) 

In  [ELLI  77]  a  solution  to  the  update  synchronization 
problem,  called  the  ring  structured  solution,  using  a 
sequential  propagation  of  synchronization  and  update  mes¬ 


sages  is  presented.  The  nodes  of  the  network  are  ordered  in 
a  circular  fashion  and  any  node  only  receives  messages  from 
its  predecessor  and  only  sends  messages  to  its  successor. 


An  update  request  entered  at  a  given  node  circulates  through 
the  set  of  database  nodes.  No  updates  are  performed  during 
this  cycle.  An  update  request  ul  can  be  temporarily  removed 
from  the  cycle  by  any  node  j  if  there  is  an  outstanding  up¬ 
date  request,  u2,  generated  at  node  j  of  higher  priority 
than  ul  and  which  conflicts  with  it.  Previously  removed  re¬ 
quests  are  released  by  site  j  when  all  outstanding  locally 
generated  updates  have  been  performed  at  all  the  copies  of 
the  database.  The  return  of  the  update  request  to  its 
source  node  after  its  first  round  trip  indicates  its  accep¬ 
tance  by  all  the  other  nodes  and  generates  a  "perform  up¬ 
date"  message.  This  message  circulates  through  all  the 
nodes  causing  the  update  to  be  actually  performed.  No  new 
updates  can  be  introduced  at  site  i  if  it  knows  of  the  ex¬ 
istence  of  any  outstanding  update  request  which  have  not 
been  completed  yet. 

The  above  solution  preserves  mutual  and  internal  con¬ 
sistency  by  enforcing  that  updates  are  applied  to  all  the 
copies  of  the  database  in  the  same  order.  This  approach  is 
somewhat  restrictive  and  exhibits  a  low  degree  of  concurren¬ 
cy.  Communications  cost  for  this  solution  are  low  (  2*n*M  ) 
at  the  expense  of  lengthy  communication  delays  introduced  by 
the  serial  propagation  of  messages.  A  formal  verification 
technique  has  been  carried  out  by  Ellis  [ELLI  77]  to  show 
the  correctness  of  the  ring  structured  solution.  However, 
no  indication  was  given  as  to  how  robustness  and  error 


The  requesting  site  (the  preferred  read  site)  waits  un¬ 
til  it  receives  an  acknowledgment  message  (ACK)  for  the  REQ 


message  from  all  the  sites  in  the  network.  These  ACKs  may 
contain  updates  which  must  be  forwarded  by  the  requesting 
site  to  each  of  the  other  read  sites.  These  updates  are 
placed  into  Read  Command  messages  sent  from  the  preferred 
read  site  to  the  remaining  read  sites.  Therefore,  the  write 
part  of  a  transaction  " pyggibacks"  onto  the  read  part  of 
subsequent  transactions. 

Upon  receipt  of  a  Read  Command,  a  read  site  performs 
the  updates  in  the  order  of  their  timestamps  and  then  exe¬ 
cutes  the  read  operation  for  the  transaction  in  question. 
Therefore,  since  the  REQ  message  was  sent  to  all  sites  in 
the  network  to  collect  all  outstanding  updates  (it  i3  cru¬ 
cial  that  all  the  sites  be  involved  at  this  step)  and  since 
the  read  sites  apply  all  the  updates  before  reading,  it 
follows  that  no  read  operation  takes  place  until  all 
relevant  write  operations  for  the  site  of  the  read  have  been 
carried  out.  It  is  this  property  of  the  protocol  which  is 
important  in  preserving  the  consistency  of  the  DB. 

The  performance  of  this  protocol  however,  degrades 
seriously  with  the  size  of  the  network.  In  particular,  the 
overhead  communications  cost  to  do  an  update  under  this  pro¬ 
tocol  is  approximately  (2*n  -  1)  messages,  where  n  is  the 
number  of  sites  in  the  network  and  not  the  number  of  parti- 
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Other  proposed  schemes,  called  primary  copy  strategies 
have  been  suggested  in  [BUNC  75,  GRAP  76  and  ALSB  76]. 
Alsberg  in  [ALSB  76]  introduced  some  techniques  aimed  at 
providing  a  certain  degree  of  resiliency  to  the  single  pri¬ 
mary,  multiple  backup  strategies  discussed  in  [BUNC  75  and 
GRAP  76].  The  primary  copy  scheme  is  primarily  designed  to 
maintain  mutual  consistency  of  databases  subject  to  a  some¬ 
what  limited  types  of  update  operations  and  it  does  not  ad¬ 
dress  explicitly  the  problem  of  internal  consistency  of  a 
DDB  subject  to  transactions  of  the  more  general  type  sup¬ 
ported  by  the  CLC  protocol. 

1-5.  -  Conclusion 

The  first  part  of  this  chapter  outlines  what  we  believe 
to  be  a  fairly  general  solution  to  synchronization  issues  in 
distributed  systems  in  the  face  of  asynchronous  unplanned 
failures.  The  algorithms  and  protocols  for  normal  operation 
and  recovery  are  robust  with  respect  to  the  criteria  set  up 
at  the  beginning  of  section  3-2.  We  are  unaware  of  any  oth¬ 
er  synchronization  protocols  which  simultaneously  satisfy 


each  of  those  requirements. 


The  work  is  primarily  suitable  for  environments  in 
which  the  cost,  including  delay,  of  sending  messages  is  not 
high  relative  to  the  operations  which  are  to  be  performed 
once  locking  is  complete.  Locally  distributed  systems  often 
provide  examples  of  such  an  environment.  Geographically 
distributed  networks  also  fall  into  this  category  if  the 
amount  of  work  to  be  performed  after  locking  is  significant 
relative  to  the  communications  cost. 

The  protocols  are  also  best  suited  for  usage  behavior 
that  cannot  be  directly  characterized  in  advance.  It  is  as¬ 
sumed  that  query  and  update  activity  will  be  largely  ad  hoc 
in  nature  -  the  more  general  case  which  has  been  receiving 
increasing  attention  in  recent  years. 

The  presentation  of  any  substantial  protocol  would  not 
be  complete  without  an  outline  of  a  proof  that  the  protocol 
is  correct  with  respect  to  its  desired  properties.  A  signi¬ 
ficant  portion  of  this  document  is  therefore  devoted  to  that 
purpose.  In  conclusion,  these  protocols  should  help  demon¬ 
strate  the  practicality  of  integrated  cooperation  of  activi¬ 
ties  in  distributed  systems. 

The  second  part  of  this  chapter  presents  two  solutions 
to  the  problem  of  deadlock  detection  in  distributed  data¬ 
bases.  The  first  solution  consists  of  a  hierarchy  of  lock 
controllers  and  is  intended  to  achieve  better  performance, 
in  terms  of  communications  cost,  than  a  centralized  ap- 


tics  of  these  protocols  is  the  subject  of  further  work. 
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CHAPTER  4 


A  FORMAL  MODEL  OF  CRASH  RECOVERY  IN  COMPUTER  SYSTEMS 


4.1  - 


This  chapter  is  concerned  with  issues  of  crash  recovery 


in  computer  systems.  In  particular  it  addresses  the  follow¬ 


ing  problem.  Given  a  set  of  objects  (processes,  files, 


etc.),  a  set  of  snapshots  for  each  one  of  them  and  the 


records  of  information  flow  between  them,  find  the  set  of 


objects  and  their  snapshots  which  should  be  used  to  restore 


the  state  of  the  system  to  a  consistent  state,  once  an  error 


in  any  of  the  (potentially  interacting)  objects  is  detected 


A  formal  model  of  crash  recovery,  in  the  form  of  a  graph,  is 


introduced  in  section  4.  Several  important  results  regard¬ 


ing  the  properties  of  the  graph  are  stated  and  proved 


Among  these  results  we  have: 


a  specially  defined  cutset  of  the  graph,  called  a 


crash  set .  identifies  the  objects  and  their 


snapshots  which  should  be  used  for  consistent 


state  restoration 


b.  all  the  snapshots  in  any  directed  cycle  of  the 


graph  are  useless  and  can  be  therefore  discarded 


since  they  can  never  participate  in  a  crash  set 
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This  chapter  concludes  with  the  presentation  of  an  ef¬ 
ficient  algorithm  to  find  the  crash  set  given  that  a  set  of 
objects  are  known  to  have  crashed. 

1-2.  -  The  Crash  Recovery  Problem 

The  strategy  for  error  recovery  which  will  be  discussed 
in  this  chapter  is  called  backward  error  recovery.  This 
technique  involves  backing  up  to  a  previous  state  one  or 
more  system  components  (files,  processes,  etc.)  when  an  er¬ 
ror  is  detected.  In  order  to  illustrate  what  is  involved  in 
backward  error  recovery  consider  the  following  situation. 
Assume  that  a  process,  PI,  crashes.  Therefore,  one  would 
like  to  backup  PI  to  its  most  recent  snapshot.  However,  all 
of  the  processes  and  files  which  have  interacted  with  PI 


after  its  most  recent  snapshot  was  taken  must  a’ so  be  backed 
up.  This  fact  may  in  turn  force  us  to  choose  an  earlier 
snapshot  for  PI,  which  in  turn  may  bring  more  processes  and 
files  into  the  set  of  objects  which  have  to  be  backed  up, 
and  so  on.  This  domino  effect  as  described  in  [RAND  77] 
continues  until  one  is  able  to  find  a  consistent  set  of 
snapshots  for  all  processes  and  data  objects  involved. 
Therefore,  backward  error  recovery  requires: 


the  existence  of  snapshots  of  files,  processes  and 
of  all  the  other  objects  that  one  wishes  to  be 
able  to  recover. 


■s. 
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b.  the  existence  of  an  information  flow  detection 
mechanism  between  distinct  processes  and  between 
processes  and  files,  in  order  to  allow  the 
recovery  software  to  determine  which  processes  and 
or  files  should  be  backed  up  and  to  which  point  in 
time  they  should  be  backed  up. 

One  way  to  graphically  represent  the  relationship 
between  snapshots  of  objects  and  information  flow  between 
them  is  shown  in  figure  4.1.  In  this  representation,  sug¬ 
gested  by  Randell  et  al .  [RAND  77],  each  horizontal  line 
represents  a  time  axis  for  each  object  and  the  vertical 
lines  represent  information  flow  between  objects.  The  vert¬ 
ical  lines  will  be  called  interaction  edges  heretofore.  The 
small  vertical  marks  in  the  time  axis  of  an  object  indicate 
the  instants  of  time  at  which  snapshots  for  that  object  were 
generated.  We  will  use  the  term  historical  version  instead 
of  snapshot  in  the  remaining  of  this  work. 

The  problem  of  backward  error  recovery  consists  of 
finding  a  consistent  set  of  historical  versions  of  all  the 
objects  involved  in  a  crash,  restoring  these  objects  and 
continuing  if  possible.  In  the  example  of  figure  4.1,  ob¬ 
jects  b,  c  and  d  have  to  be  restored  to  the  values  of  the 
historical  versions  b2,  c2  and  d 1 ,  respectively,  if  object  b 


crashes . 


vy 


The  statement  of  the  problem  will  be  made  more  precise 
as  some  definitions  and  a  formal  model  for  crash  recovery 
are  introduced  in  the  following  sections. 


4.1  -  l2JL£ 


This  section  introduces  several  basic  definitions  and  a 
necessary  and  sufficient  condition  for  consistent  state  res¬ 
toration.  The  state  of  a  system  is  the  collection  of  values 
of  all  the  objects  in  the  system. 


said  to  be 


.ent  state )  :  A  state  of  the  system  is 
if  it  is  a  state  that  the  system  might 


have  reached  through  normal  operation  starting  at  a  con¬ 
sistent  state.  The  initial  state  of  a  system  is  consistent 
by  definition. 


Z-  (. 


set ) :  Let  1  be  a  set  of  objects  and 


let  Id)  be  a  set  of  historical  versions  of  the  objects  in  S 
such  that  there  is  exactly  one  historical  version  in  H(S) 
for  each  object  in  S.  The  set  H(S)  is  said  to  be  a  recovery 
set  if  it  is  the  minimal*  set  of  historical  versions  such 
that  a  consistent  state  of  the  system  is  obtained  by  restor¬ 
ing  all  the  objects  in  S  to  their  historical  versions  in 


o./i/ «/  «,*  V*  ./  ./ .  • 


Since  H(S)  is  defined  to  be  minimal  it  must  not  include 


historical  versions  of  objects  which  need  not  be  backed  up 
in  order  to  guarantee  consistent  state  restoration.  In  the 
example  of  figure  4.1,  { b  2 ,  c2,  d 1 } ,  {a2,  b 1 ,  c2,  d 1 } ,  { c  3 } 
and  { d 2 }  are  some  examples  of  recovery  sets.  Let  us  examine 
what  are  the  necessary  and  sufficient  conditions  for  an  ar¬ 
bitrary  set  of  historical  versions  to  be  a  recovery  set. 

LEMMA  Jt . J. :  A  necessary  condition  for  an  arbitrary  set  H  = 
{hi,  h2,  hk}  of  historical  versions  of  objects  ol,  o2, 

. . . ,  ok  to  be  a  recovery  set  is  that  for  every  pair  of  his¬ 
torical  versions  in  H  there  is  no  interaction  edge  which 
predates  one  and  postdates  the  other. 

Prop f :  We  will  assume  that  H  is  a  recovery  set  and  that  lem¬ 
ma  4.1  is  false  and  show  that  we  reach  a  contradiction. 
Consider  a  pair  of  historical  versions  hi  and  hj  of  objects 
oi  and  oj  respectively.  Assume  now  the  existence  of  an  in¬ 
teraction  edge  after  the  historical  version  hi  was  taken  but 
before  hj  was  taken.  Now,  if  objects  oi  and  oj  were  re¬ 
stored  to  their  values  of  hi  and  h j ,  respectively,  then  oj 
would  be  in  a  state  which  would  be  a  function  of  the  ex¬ 
istence  of  an  interaction  edge  with  oi  which  never  existed 


*  A  set  X  is  said  to  be  minimal  with  respect  to  property  P 
if  no  proper  subset  of  X  has  property  P. 
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as  far  as  the  value  hi  for  oi  is  concerned.  Therefore  the 
values  hi  and  hj  for  objects  oi  and  oj  respectively  could 
have  never  coexisted,  or  equivalently,  the  system  could  have 
never  reached  such  a  state  through  normal  operation.  Since 
restoring  oi  and  oj  to  hi  and  hj  produces  an  inconsistent 
state  the  set  H  is  not  a  recovery  set.  We  have  a  contradic¬ 
tion  which  proves  that  lemma  4.1  must  hold  for  the  set  H  to 
be  a  recovery  set . 

A  set  of  historical  versions  satisfying  lemma  4.1  above 
will  be  called  consistent .  In  the  example  of  figure  4.1, 
the  set  {a2,b2}  cannot  be  a  recovery  set  because  lemma  4.1 
is  not  satisfied. 

Whenever  one  or  more  objects  are  known  to  be  in  error, 
it  is  necessary  to  find  the  set  of  all  the  objects  which 
must  also  be  backed  up  because  they  have  interacted  with  the 
objects  detected  to  be  in  error.  A  recovery  set  must  there¬ 
fore  include  historical  versions  for  all  such  objects.  The 
following  definition  makes  this  statement  more  precise. 

Bgflnltlpn  2:  ( completeness  ja2  A  set  <22  historical 
versions )  :  Let  H  =  {hi,  h2,  ...,  hk}  be  a  set  of  historical 
versions  of  the  associated  objects  in  S  =  {oi,  o2 ,  .  .  .  ,  ok}. 
The  set  H  is  said  to  be  complete  if  it  is  the  minimal  set 
such  that  there  is  no  interaction  edge  s.  between  an  object 
aj.  in  S  and  an  object  &  not  in  S  such  that  the  interaction 
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edge  .£  postdates  the  historical  version  hj  of  object  oj. 

In  light  of  the  above  definition  we  can  state  another 
necessary  condition  for  a  set  of  historical  versions  to  be 
a  recovery  set . 

LEMMA  4.. .2.:  A  necessary  condition  for  an  arbitrary  set  H  of 
historical  versions  to  be  a  recovery  set  is  that  it  be  com¬ 
plete. 

Proof :  The  argument  is  again  by  contradiction  and  it  follows 
the  same  line  as  the  previous  one.  If  we  assume  that  H  is  a 
recovery  set  and  that  the  condition  is  false  then  there  must 
be  at  least  one  object  jt  which  does  not  have  an  historical 
version  in  H  such  that  there  is  an  interaction  edge  between 
x  and  object  ii.  where  the  historical  version  hi  of  oi  is  in 
H.  In  this  case,  backing  up  oi  to  hi  would  generate  an  in¬ 
consistent  state  since  object  x  would  know  of  an  interaction 
with  oi  which  never  occurred  as  far  as  hi  is  concerned.  The 
minimality  of  H  with  respect  to  completeness  follows  from 
the  fact  that  H  is  also  minimal  with  respect  to  consistent 
state  restoration.  If  H  is  not  minimal  with  respect  to  com¬ 
pleteness  then  there  is  at  least  one  object  represented  in 
H  which  did  not  interact  with  any  of  the  objects  represented 
in  H  after  the  historical  versions  in  H  were  taken.  There¬ 
fore  this  object  need  not  be  backed  up  and  H  is  not  minimal 
with  respect  to  consistent  state  restoration.  This  is  a 


contradiction  and  proves  the  point. 


Given  the  example  of  figure  4.1,  the  set  { b 2 ,  c2}  can¬ 
not  be  a  recovery  set  because  it  is  not  complete. 


If  a  set  of  historical  versions  is  both  complete  and 
satisfies  lemma  4.1  then  a  consistent  state  is  achieved  if 
this  set  of  historical  versions  is  used  to  recover  the  asso¬ 
ciated  objects. 


THEOREM  4..J.:  A  necessary  and  sufficient  condition  for  an  ar¬ 
bitrary  set  H  =  {hi,  h2,  ...,hk}  of  historical  versions  of 
objects  ol,  o2 ,  ...,  ok  to  be  a  recovery  set  is  that  it  be 
complete  and  consistent. 


Prop f :  The  necessity  of  the  condition  has  been  shown  al¬ 
ready,  therefore  we  only  need  to  show  the  sufficiency.  Let 
H  =  {hi,  h2,  ...,  hk}  be  a  set  of  historical  versions  of  ob¬ 
jects  ol,  o2,  ...»  ok,  such  that  H  satisfies  lemmas  4.1  and 
4.2.  The  completeness  of  H  guarantees  that  it  is  not  neces¬ 
sary  to  backup  any  other  object  because  objects  ol  through 
ok  are  backed  up.  The  completeness  of  H,  by  its  minimality, 
also  guarantees  that  no  unnecessary  object  will  be  backed 
up,  therefore  guaranteeing  that  H  is  minimal  with  respect  to 
consistent  state  restoration.  Lemma  4.1  guarantees  that  the 
historical  versions  in  H  could  have  coexisted  as  values  of 
objects  ol  through  ok.  Therefore,  the  state  obtained  by 
backing  up  according  to  the  recovery  set  H  generates  a  con- 


sistent  state  for  the  system. 


There  are  several  recovery  sets  to  which  the  system  can 
be  backed  up  when  a  set  of  objects  are  detected  to  be  in  er¬ 
ror.  From  these  sets  we  want  to  select  the  one  that  minim¬ 
izes  the  amount  of  backing  up  necessary.  This  recovery  set 
is  called  the  crash  set . 


I:  ( crash  set  an 


:  Given  an  object  a 


detected  to  be  in  error,  the  crash  set  of  x,  denoted 
crash_set ( x ) ,  is  the  latest  possible  recovery  set  which  con¬ 
tains  an  historical  version  of  the  object.  By  latest  possi¬ 
ble  we  mean  the  one  which  contains  the  most  recent  possible 
historical  version  of  the  objects  in  the  set. 


From  the  example  of  figure  4.1,  we  have  that 


crash_set ( a) 


{ a  3 } ,  crash_set ( b ) 


{ a  3 ,  b2,  c2 ,  d2}, 


crash_set(c)  =  {c3)  and  crash_set{d}  =  { d 2 } . 


The  above  definition  of  crash  set  of  an  object  will  be 
generalized  to  the  notion  of  a  crash  set  of  a  set  of  objects 
in  section  4.7.1  after  we  introduce  a  formal  model  of  crash 
recovery . 
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4.4  -  Formal  Model  &£  Crash  ijg,£gvg.ry 

This  section  introduces  a  formal  model  of  crash 
recovery.  This  model  is  in  the  form  of  a  graph  called 
history  graph  .  This  graph  has  two  types  of  branches,  namely 
directed  branches  and  undirected  ones.  There  is  a  directed 
branch  associated  with  each  historical  version  of  each  ob¬ 
ject,  Branches  of  this  type  are  called  hv-branches  (for 
historical  version).  The  nodes  of  the  history  graph  have  no 
semantics  associated  with  them  but  they  serve  as  endpoints 
of  hv-branches.  There  is  an  undirected  branch  associated 
with  each  interaction  edge.  These  branches  are  called  ie- 
branches  (for  interaction  edge).  The  history  graph  is  de¬ 
fined  by  the  following  rules: 

a.  For  every  object  there  is  a  directed  path  of  hv- 
branches  containing  a  branch  for  every  historical 
version  of  that  object.  This  path  is  such  that  if 
we  traverse  it  in  its  proper  direction  we  will  en¬ 
counter  all  the  historical  versions  of  t  ho  t  object 
in  chronological  order. 

b.  Hv-branches  are  labeled  with  historical  version 
names.  Each  node  in  the  directed  path  associated 
with  each  object  is  labeled  with  the  3ame  label  of 
the  only  directed  branch  incident  out  of  it,  if 
there  is  one.  If  the  node  has  no  outgoing  direct- 
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ed  branches  then  it  is  labeled  with  the  name  of 


the  object  followed  by  a  star  (*). 

c.  Let  vi  and  vj  be  two  nodes  in  the  graph  associated 
with  objects  oi  and  oj  respectively.  Let  hi  and 
'nj  be  the  labels  of  the  directed  branches  incident 
into  vi  and  vj  respectively.  There  is  an  ie- 
branch  between  nodes  vi  and  vj  if  an  interaction 
occurred  between  objects  oi  and  oj  after  histori¬ 
cal  versions  hi  and  hj  were  taken  but  before  any 
other  historical  version  for  these  objects  was 
taken . 

d.  We  assume  that  there  is  always  an  initial  histori¬ 
cal  version  of  each  object  which  is  taken  before 
any  interaction  edge  between  the  object  and  any 
other  object  occurs. 

Let  us  add  an  additional  node  to  the  history  graph, 
called  the  source  node  and  labeled  .a.  There  is  an  undirect¬ 
ed  branch  from  this  node  to  the  starting  node  of  the  direct- 
i 

i 

ed  path  of  each  object.  This  node  is  introduced  to  make 
the  graph  connected*. 


•  A  graph  with  directed  branches  is  said  to  be  connected  if 
in  the  underlying  undirected  version  of  it  there  is  a  path 
between  every  pair  of  nodes. 


Figure  4.2  is  the  history  graph  for  the  situation  il¬ 
lustrated  in  figure  4.1. 

We  will  now  examine  some  of  the  interesting  properties 
of  the  history  graph.  This  properties  will  be  presented  as 
theorems  . 

THEOREM  H-Z:  Let  H  be  a  recovery  set.  The  set  of  hv- 
branches  associated  with  the  elements  of  H  is  a  cutset*  of 
the  history  graph. 

Proof :  We  will  first  prove  that  the  removal  of  the  branches 
in  H**  disconnects  the  history  graph  which  is  initially  con¬ 
nected  and  then  we  will  prove  that  H  is  minimal  with  respect 
to  the  property  of  being  a  disconnecting  set.  Before 
proceeding  to  the  proof  some  definitions  are  in  order. 

Let  G  be  the  history  graph.  Let  W  be  the  set  of  ob¬ 
jects  associated  with  the  historical  versions  in  H.  Let  G-H 
be  the  graph  obtained  from  G  by  removing  from  G  all  the 
branches  in  H.  Let  Gx  be  the  subgraph  of  G-H  induced  by  the 
set  of  nodes  X  defined  as  follows: 

X  =  { s  >  U  { x !  there  is  a  directed  path  in  G-H 
from  s  to  x.  } 

*  A  cutset  is  a  minimal  collection  of  branches  whose  removal 
results  in  a  non  connected  graph  [FRAN  71]. 

**  For  the  sake  of  simplicity  we  use  the 
express  ion ...  branches  in  H.. .instead  of  the  precise 
expression ...  the  set  of  hv-branches  associated  with  the 
elements  of  H. 
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Let  Y  be  the  set  of  nodes  of  G  which  are  not  in  X  and 
let  Gy  be  the  subgraph  of  G  induced  in  G-H  by  the  nodes  in 
Y.  Figure  4.3  illustrates  the  above  definitions.  Since  the 
set  H  contains  no  historical  versions  of  objects  not  in  W 
then  the  directed  path  of  such  objects  is  totally  contained 
in  Gx,  therefore  the  set  Y  only  contains  nodes  associated 
with  objects  in  W. 

Let  us  prove  the  first  part  of  the  theorem.  By  defini¬ 
tion  of  the  set  '  X,  there  is  no  directed  branch  in  G-H 
between  a  node  in  X  and  a  node  in  Y  otherwise  the  node  in  Y 
would  be  in  X.  It  remains  for  us  to  prove  that  there  are  no 
undirected  branches  or  ie-branches  in  G-H  between  a  node  in 
X  and  a  node  in  Y.  There  are  two  possible  cases  to  consid¬ 
er: 

Case  JL :  let  jgQ.  be  an  ie-branch  between  nodes  in  the  direct¬ 
ed  paths  of  objects  in  W. 

The  existence  of  such  a  branch  indicates  that  there  is 
an  interaction  edge  that  predates  one  historical  version  in 
H  and  postdates  another.  Lemma  4.1  is  thus  violated  con¬ 
tradicting  the  assumption  that  H  is  a  recovery  set.  There¬ 
fore  an  ie-branch  such  as  el  cannot  exist. 

Case  £:  let  be  an  ie-branch  between  a  node  in  Y  and  a 
node  in  the  directed  path  of  an  object  not  in  W. 
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The  existence  of  such  a  branch  indicates  that  there  is 
an  interaction  edge  between  an  object  a  in  W  and  an  object  z. 
not  in  W  which  postdates  the  historical  version  of  w.  This 
is  a  violation  of  lemma  4.2  and  contradicts  the  assumption 
that  H  is  a  recovery  set.  Therefore  an  ie-branch  such  as  e2 
cannot  exist . 

Since  the  only  branches  that  connect  nodes  in  X  to 
nodes  in  Y  are  the  branches  in  H,  the  removal  of  these 
branches  disconnects  the  graph  G. 

We  will  now  prove  that  no  proper  subset  of  H  discon¬ 
nects  G.  Let  us  first  show  that  Gy  is  connected.  Assume 
that  Gy  is  not  connected.  Then  there  is  at  least  one  object 
M  in  W  which  did  not  interact  with  any  other  object  after 
its  historical  version  in  H  was  taken.  Therefore,  there  is 
no  object  x  in  W  such  that  there  is  an  interaction  edge 
between  x  and  w  which  postdates  the  historical  version  of  x 
in  H.  So,  the  set  H  is  not  complete  which  contradicts  the 
assumption  that  H  is  a  recovery  set. 

Now  assume  that  H  is  not  minimal  then  there  is  at  least 
one  branch  hi  in  H  whose  removal  is  not  necessary  to  discon¬ 
nect  the  graph.  Therefore  the  graph  G  -  (H— {hi} )  is  discon¬ 
nected.  But  since  the  graphs  Gx  and  Gy  are  connected  and 
since  hi  connects  a  node  in  Gx  to  a  node  in  Gy,  the  graph  Gx 
+  Gy  +  {hi}  is  connected  which  contradicts  the  assumption 
that  H  is  not  minimal  and  proves  its  minimality. 
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THEOREM  4. .  3. :  Let  H  =  {hi,  h2 ,  ...,hk}  be  a  cutset  of  the 

history  graph  which  contains  no  ie-branches  and  such  that 


each  of  the  branches  in  H  corresponds  to  an  historical  ver¬ 
sion  of  a  distinct  object.  Then  the  set  of  historical  ver¬ 


sions  associated  with  H  is  a  recovery  set 


Proof :  In  order  to  prove  this  theorem  we  have  to  show  that 
lemmas  4.1  and  4.2  hold,  since  these  conditions  together 
form  a  necessary  and  sufficient  condition  for  H  to  be  a 
recovery  set.  Let  us  define  the  sets  X,  Y  and  W  and  the 
graphs  G,  G-H ,  Gx  and  Gy  as  in  the  proof  of  theorem  4.2. 
Let  us  first  prove  that  Gy  is  connected.  Assume  that  Gy  is 
not  connected.  That  being  the  case,  it  has  at  least  two 
components.  Let  C  be  one  of  the  components  and  let  Gy  -  C 
be  the  graph  formed  by  the  remaining  components.  Since  G  is 
connected,  there  is  at  least  nne  branch  of  G  which  goes  from 
a  node  of  Gy  to  a  node  of  C.  The  remaining  branches  of  H 
connect  nodes  of  Gx  to  nodes  of  Gy-C.  But  removing  the 
branches  of  H  which  connect  Gx  to  C  disconnects  the  graph 
and  therefore  H  is  not  a  minimal  disconnecting  set.  This 
fact  contradicts  the  assumption  that  H  is  a  cutset  and 
proves  that  Gy  is  connected.  Since,  by  assumption,  H  is  a 
cutset  there  is  no  ie-branch  between  a  node  in  X  and  a  node 
in  Y  otherwise  removal  of  the  branches  in  H  would  not 
disconnect  G.  Therefore,  there  is  no  interaction  edge  which 
predates  an  historical  version  in  H  and  postdatess  another 
historical  version  in  H.  So,  lemma  4.1  is  satisfied.  Also, 
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there  is  no  interaction  edge  between  an  object  w  in  W  and 


and  an  object  &  not  in  W  which  postdates  the  historical  ver¬ 
sion  of  w  in  H.  In  order  to  prove  that  H  is  complete  it 
only  remains  for  us  to  show  that  there  is  no  proper  subset 
HI  of  H  such  that  there  is  no  interaction  edge  between  an 
object  xl  in  the  set  W1  of  objects  associated  with  the  his¬ 
torical  versions  in  HI  and  and  an  object  z  not  in  Hi  which 
postdates  the  historical  version  of  xl.  Assume  that  there 
is  such  a  subset  HI.  Then,  there  is  no  ie-branch  in  Gy 
between  nodes  of  Gy  associated  with  the  elements  of  W1  and 
elements  not  in  W1.  Therefore  Gy  is  disconnected  which  is  a 
contradiction  and  proves  that  a  subset  such  as  HI  cannot  ex¬ 
ist.  Therefore  lemmas  4.1  and  4.2  are  satisfied  and  H  is  a 
recovery  set  . 

The  previous  two  theorems  can  be  combined  to  yield  the 
following  result. 

THEOREM  4..  4.:  A  set  of  historical  versions  H  =  {hi,  h2,  ..., 
hk}  is  a  recovery  set  if  and  only  if  the  set  of  branches  as¬ 
sociated  with  the  historical  versions  in  H  is  a  cutset  of 
the  history  graph  such  that  each  branch  in  H  corresponds  to 
an  historical  version  of  a  distinct  object  (i.e.  is  an  hv- 
branch ) . 
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1*5.  -  Condensed  History  Qrapfa 

Since  no  cutset  consisting  only  of  hv-branches 
separates  two  nodes  connected  by  an  ie-branch  we  can 
collapse  the  graph  by  combining  every  pair  of  nodes  connect¬ 
ed  by  an  ie-branch  into  a  single  "super-node".  The  label  of 
a  super-node  is  the  union  of  the  labels  of  the  nodes  of  the 
component  nodes.  The  set  of  branches  incident  into  a 
super-node  is  the  union  of  the  sets  of  branches  incident 
into  the  two  component  nodes.  Similarly,  the  set  of 
branches  incident  out  of  a  super-node  is  the  union  of  the 
sets  of  branches  incident  out  of  the  two  component  nodes. 
The  resulting  graph  after  all  collapsing  has  been  done  is 
called  a  condensed  history  graph .  Of  course  the  component 
nodes  may  themselves  be  supernodes  which  resulted  from  a 
previous  collapsing  operation.  Since  this  graph  only  con¬ 
tains  hv-branches  we  will  simply  use  the  term  branch  when 
referring  to  the  branches  of  the  condensed  history  graph. 
Figure  4.4  shows  the  condensed  history  graph  for  the  history 
graph  in  figure  4.2. 

The  collapsing  operation  may  create  parallel*  branches 
or  self-loops**  in  the  graph.  Any  cutset  of  a  graph  con¬ 
tains  either  all  the  parallel  branches  between  two  nodes  or 

*  Two  or  more  branches  of  a  graph  are  said  to  be  parallel  if 
they  have  the  same  endpoints. 


**  A  self-loop  is  a  branch  which  is  incident  out  of  the  same 
node  into  which  it  is  incident. 


Condensed  History  Graph 


none  of  them.  Therefore  we  can  combine  all  the  parallel 
branches  into  a  single  branch  labeled  with  the  set  of  labels 
of  the  branches  being  combined.  Also,  since  a  self-loop 
will  not  appear  in  any  cutset  of  a  graph  it  can  be  eliminat¬ 
ed  from  the  graph.  This  fact  has  the  implication  that  the 
historical  version  associated  with  the  self-loop  is  useless 
as  far  as  crash  recovery  is  concerned.  Consequently  it  need 
not  be  stored  anymore  and  can  be  discarded.  As  it  can  be 
seen  branches  a2  and  bl  have  been  combined  into  a  single 
branch  since,  otherwise  they  would  be  parallel  branches. 
Also,  historical  version  cl  disappeared,  since  otherwise  it 
would  be  a  self-loop  of  the  graph.  This  result  can  be  gen¬ 
eralized  as  follows: 

THEOREM  I.5.:  Let  C  be  a  directed  cycle  in  the  history  graph 
in  the  sense  that  all  hv-branches  in  the  cycle  are  in  the 
same  direction  if  we  traverse  the  cycle.  The  set  of  histor¬ 
ical  versions  associated  with  the  hv-branches  in  the  cycle 
are  useless  and  can  be  discarded. 

Proof :  We  are  going  to  prove  that  no  hv-branch  in  C  can  be¬ 
long  to  a  cutset  of  the  history  graph  containing  only  hv- 
branches  and  such  that  all  the  branches  in  the  cutset  are 
associated  with  distinct  objects.  It  follows  that  such 
hv-branches  cannot  be  part  of  any  recovery  set  and  represent 
useless  historical  versions  which  can  be  discarded.  Let  H  = 
{hi,  h2,  ...,hk)  be  a  cutset  of  the  history  graph  G  such 


that  each  branch  in  the  cutset  is  an  hv-branch  and  is  asso¬ 
ciated  with  a  distinct  object.  Let  the  sets  X  and  Y  and  the 
graphs  Gx  and  Gy  be  defined  as  in  the  proof  of  theorem  4.2. 
Let  us  first  observe  that  all  the  branches  in  H  are  incident 
out  of  nodes  in  X  and  incident  into  nodes  in  Y.  This  fol¬ 
lows  from  the  definition  of  the  set  X,  since  the  node  s  be¬ 
longs  to  the  set  X  and  all  the  initial  nodes  of  branches  in 
H  can  be  reached  via  a  directed  path  from  s.  Now,  let  the 
hv-branch  hi  belong  to  a  directed  cycle  C  and  also  to  the 
cutset  H.  Let  us  now  traverse  the  cycle  C.  If  we  go  from  X 
to  Y  by  traversing  the  branch  hi  we  must  eventually  return 
to  X  by  traversing  a  branch  j..  The  edge  e  must  be  in  the 
cutset  H  otherwise  there  is  a  path  between  every  node  in  X 
and  a  node  in  Y  which  contains  e  and  G-H  would  not  be 
disconnected.  The  branch  e  cannot  be  an  hv-branch  otherwise 
it  would  be  incident  into  a  node  of  X  which  is  impossible  as 
observed  above.  Also,  the  branch  e  cannot  be  an  ie-branch 
since  by  assumption  H  only  contains  hv-branches.  In  conclu¬ 
sion,  it  is  not  possible  for  an  hv-branch  to  belong  to  the 
directed  cycle  C  and  to  the  cutset  H. 

Let  us  observe  that  when  we  collapse  nodes  of  a  history 
graph  into  super-nodes  we  are  essentially  hiding  all  the 
ie-branches  "inside"  the  super-nodes.  So,  a  directed  cycle 
in  the  history  graph  appears  in  the  condensed  history  graph 
as  a  directed  cycle  containing  only  hv-branches.  There  is  a 


one  to  one  correspondence  between  cycles  in  the  history 
graph  and  its  condensed  version.  We  have  therefore  the  fol¬ 
lowing  result . 


THEOREM  J4.&:  Let  C  be  a  directed  cycle*  in  the  condensed 
history  graph.  The  set  of  historical  versions  associated 
with  the  hv-branches  in  the  cycle  are  useless  and  can  be 
discarded . 


Another  Interesting  property  of  the  collapsed  history 
graph  is  that  we  do  not  need  to  store  individual  interaction 
edges  anymore  once  their  effect  on  the  graph  has  been  taken 
into  account.  The  importance  of  this  result  can  be  better 
appreciated  if  one  makes  the  following  observations: 


If  a  crash  destroys  an  historical  version  of  an 
object  then  an  earlier  crash  set  may  have  to  be 


used  to  recover  from  a  crash  of  that  object.  How¬ 
ever  the  ability  to  recover  to  a  consistent  state 
is  not  compromised  by  the  accidental  or  intention¬ 
al  loss  of  an  historical  version. 


If  any  interaction  edge  is  lost  then  correct  crash 
recovery  may  not  be  possible  any  more  since  some 
of  the  crash  sets  may  ignore  the  existence  of  in¬ 
teraction  edges  which  actually  occurred. 


•  We  will  consider  a  self-loop  as  a  degenerate  case  of  a 
cycle  . 
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Observation  b  above  implies  that  interaction  edges  must 
be  reliably  stored.  This  problem  gets  more  complex  because 
the  amount  of  interaction  edges  generated  per  unit  of  time 
can  be  fairly  large.  One  can  expect  thousands  of  interac¬ 
tion  edges  to  be  generated  during  one  day  of  operation  of  a 
moderately  loaded  computer  system.  Therefore,  if  we  had  to 
store  all  of  the  interaction  edges  one  would  have  to  have  a 
sort  of  file  system  for  that  purpose.  This  file  system 
3hould  be  designed  to  be  extremely  reliable.  We  would  also 
have  the  problem  of  garbage  collecting  interaction  edges  and 
retrieving  them  when  needed  for  crash  set  calculation. 
Since  interactions  edges  are  crucial  for  correct  crash 
recovery,  extreme  care  has  to  be  taken  in  deciding  which 
ones  can  be  discarded. 

All  of  the  above  problems  associated  with  having  to 
store  interaction  edges  have  been  solved  by  the  introduction 
of  the  collapsed  history  graph. 

A  relationship  between  recovery  sets  and  cutset? 
condensed  history  graph,  similar  to  the  rep..1 
theorem  4.4,  can  also  be  proved.  To  this  ent  -  *  - 

prove  the  following  two  lemmas. 

LEMMA  4.1:  Let  H  =  [hi.  h2,  . .  . 

tor y  graph  containing 
branch  in  H  is  ass::-.  :  - 
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distinct  object.  Let  He  =  (cl,  c2,  .  ..,  cm}  be  the  set  of 
branches  in  the  condensed  version  of  G,  Gc ,  associated  with 
the  branches  in  H.  The  set  He  is  a  cutset  of  the  graph  Gc 
such  that  the  labels  on  the  branches  in  He  are  associated 
with  historical  versions  of  distinct  objects. 

Proof :  Let  the  sets  X  and  Y  and  the  graphs  G,  G-h,  Gx  and  Gy 
be  defined  as  in  the  proof  of  theorem  4.2.  Let  Gxc  be  the 
condensed  version  of  Gx  and  let  Gyc  be  the  condensed  version 
of  Gy.  Since  H  is  a  cutset  with  no  ie-branches,  there  is  no 
path  containing  only  ie-branches  between  the  endpoints  of 
any  branch  in  H.  Therefore,  the  endpoints  of  such  branches 
are  not  part  of  the  same  super-node  in  the  graph  Gc .  There¬ 
fore,  the  associated  branches  in  He  are  all  incident  out  of 
nodes  of  Gxc  and  incident  into  nodes  of  Gyc. 

Let  us  show  now  that  the  removal  of  the  branches  in  He 
disconnects  Gc  by  breaking  all  paths  from  Gxc  to  Gyc.  Let 
us  recall  that  Gx  is  connected  by  definition  and  that  Gy  was 
shown  to  be  connected  in  the  proof  of  theorem  4.3.  Since 
the  operation  of  collapsing  nodes  does  not  remove  any 
branch,  the  graphs  Gxc  and  Gyc  are  also  connected.  Since 
the  branches  in  He  are  the  only  ones  which  connect  nodes  in 
Gxc  to  nodes  in  Gyc,  the  removal  of  such  branches  discon¬ 
nects  the  graph  Gc  by  breaking  all  the  paths  between  nodes 
in  Gxc  and  nodes  in  Gyc.  This  set  is  the  minimal  set  with 
such  a  property,  since  any  branch  in  He  is  part  of  a  path 


171 


between  Gxc  and  Gyc  and  must  therefore  be  removed  if  Go  is 
to  be  disconnected. 


LEMMA  Jt.Jt:  Let  He  be  a  cutset  of  the  condensed  history  graph 
Gc  such  that  the  labels  on  the  branches  in  He  are  associated 
with  historical  versions  of  distinct  objects.  Let  H  be  the 
set  of  branches  of  the  history  graph,  G,  associated  with  the 
branches  in  He.  The  set  H  is  a  cutset  of  the  graph  G  with 
no  ie-branches  and  such  that  the  branches  in  H  are  associat¬ 
ed  with  historical  versions  of  distinct  objects. 


Proof :  Let  Gxc  and  Gyc  be  the  two  components  into  which  the 
graph  Gc  is  separated  when  the  branches  in  He  are  removed 
from  it.  Let  the  graphs  Gx  and  Gy  be  the  uncondensed  ver¬ 
sions  of  the  graphs  Gxc  and  Gyc  respectively.  In  order  to 
obtain  the  uncondensed  version  of  a  condensed  graph  Gc  we 
must  replace  every  super-node  in  Gc  by  the  connected  sub¬ 
graph  in  G,  consisting  of  ie-branches  only,  which 
corresponds  to  the  super-node.  If  Gc  is  connected  so  is  G. 
Also,  the  endnodes  of  any  ie-branch  in  G  are  combined  into 
the  same  super-node  of  Gc.  Therefore,  there  is  no  ie-branch 
connecting  a  node  in  Gx  to  a  node  in  Gy.  Otherwise,  assume 
that  there  was  such  a  branch.  Its  endpoints  would  be  col¬ 
lapsed  into  the  same  super-node  which  would  be  both  in  Gxc 
and  in  Gyc.  But  this  contradicts  the  assumption  that  Gxc 
and  Gyc  are  two  components  of  Gc .  Therefore,  all  the 
branches  connecting  nodes  in  Gx  to  nodes  in  Gy  are  those  in 
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H.  It  follows  that  the  removal  of  these  branches  discon 


nects  G.  Since  Gx  and  Gy  are  connected  this  set  is  also 
minimal.  Otherwise  assume  that  H  is  not  minimal.  Then 
there  is  at  least  one  branch  hi  in  H  whose  removal  is  not 
necessary  to  disconnect  the  graph.  Therefore  G— (H— {hi} )  is 
disconnected.  But  since  there  is  a  path  from  every  node  in 
Gx  to  every  node  in  Gy  passing  through  hi,  the  graph  Gx  Gy 
♦  {hi}  is  connected.  This  fact  is  a  contradiction  and 
proves  that  H  is  minimal.  Finally,  the  fact  that  the 
branches  in  H  correspond  to  historical  versions  of  distinct 
follows  directly  from  the  fact  that  the  labels  on  the 
branches  of  He  also  correspond  to  historical  versions  of 
distinct  objects.  This  concludes  the  proof. 

Given  lemmas  4.1  and  4.2  we  can  now  state  and  prove  the 
following  result. 

THEOREM  iL . i :  An  arbitrary  set  H  =  {hi,  h2,  ...,  hk}  of  his¬ 
torical  versions  of  distinct  objects  is  a  recovery  set  if  an 
only  if  the  set  of  branches  associated  with  the  historical 
versions  in  H  is  a  cutset  of  the  condensed  history  graph 
such  that  the  labels  on  the  branches  of  H  correspond  to  his¬ 
torical  versions  of  distinct  objects. 

Proof :  The  proof  of  this  theorem  is  rather  simple  given 
theorem  4.4  and  lemmas  4.1  and  4.2.  If  H  is  a  recovery  set, 
then  by  theorem  4.4  the  set  of  branches  associated  to  the 
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historical  versions  in  H  is  a  cutset  of  the  history  graph. 


But  by  lemma  4.3  the  set  of  branches  associated  with  the 


cutset  in  the  history  graph  is  a  cutset  in  the  condensed 


history  graph.  To  show  that  the  theorem  is  valid  in  the 


other  direction  we  must  use  lemma  4.4  and  theorem  4.4. 


The  history  graph  or  the  condensed  history  graph  are 


dynamic  in  the  sense  that  they  change  as  new  historical  ver¬ 


sions  are  generated,  as  interaction  edges  occur  or  as  his¬ 


torical  versions  are  intentionally  discarded  or  lost  due  to 


crashes.  The  next  section  describes  which  operations  have 


to  be  performed  on  the  graph  in  order  to  model  the  above 


described  events. 
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Graph 


There  are  three  operations  that  can  be  performed  on  the 


condensed  history  graph  that  are  of  interest  to  us,  namely: 


a.  addition  of  a  new  historical  version 


b.  addition  of  an  interaction  edge. 


c.  intentional  or  accidental  loss  of  an  historical 


version . 


Adding  a  new  historical  version  is  a  simple  operation 


The  addition  of  an  interaction  edge  causes  two  nodes  of  the 
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graph  to  be  collapsed  into  a  single  node.  The  loss  of  an 
historical  version  causes  the  endnodes  of  that  historical 
version  to  be  collapsed  into  a  single  node.  Each  time  that 
tvo  nodes  of  a  directed  graph  are  collapsed  there  is  a  po¬ 
tential  for  the  formation  of  directed  cycles  or  self-loops. 
For  instance,  if  in  the  graph  of  figure  4. 5. a  nodes  1  and  4 
are  collapsed  into  a  single  node  we  get  a  directed  cycle  as 
Indicated  in  figure  4.5.b.  Also,  if  nodes  5  and  6  are  col¬ 
lapsed  into  a  single  node  we  get  a  self-loop. 

This  observation  becomes  important  because,  as  shown  in 
theorem  4.6,  self-loops  or  branches  which  belong  to  directed 
cycles  in  the  condensed  history  graph  represent  useless  his¬ 
torical  versions.  These  historical  versions  can  be  discard¬ 
ed  because  they  cannot  belong  to  any  recovery  set.  It  is 
necessary  to  introduce  some  notation  before  giving  an  algo¬ 
rithmic  description  of  the  operations  mentioned  above.  Let 
label ( v)  be  the  label  of  a  node  v.  Let  S:v  ->  w  denote  a 
directed  branch  incident  out  of  node  v  and  incident  into 
node  w  and  labeled  with  the  set  S.  Let  lastnode ( x)  be  the 
node  in  the  graph  whose  label  contains  x§.  Let  into( v)  be 
the  set  of  branches  incident  into  node  v  and  let  out ( v)  be 
the  set  of  branches  incident  out  of  v. 


Figure  4.5  -  Example  of  Generation  of 
Cycles  and  Self-Loops  (fig.  4.5.b)  due 
to  Collapsing  of  Nodes  in  Graph  of 
fig.  4. 5. a  . 


l.&.l  -  Adding  &  new  Historical  Version 

The  following  steps  must  be  taken  in  order  to  add  to 
the  condensed  history  graph  Gc  the  historical  version  xi  of 
object  x. 

51  -  Add  branch  {xi} :lastnode( x)  ->  w,  where  w  is  a  node 
not  in  Gc. 

52  -  label ( lastnode ( x) )  <-  ( label ( lastnode ( x ) )  -  {x*})  U 
{xi} 

53  -  label (w)  <-  {x» } 

-  Aiding  ACL  Interaction  Edge  and  Discarding  Historical 
Versions 

These  two  operations  are  very  similar  in  terms  of  what 
happens  to  the  graph.  They  both  require  a  collapse  routine 
to  collapse  two  nodes  into  a  single  node.  In  the  case  of  an 
addition  of  an  interaction  edge  the  two  nodes  to  be  col¬ 
lapsed  are  the  two  lastnodes  of  the  objects  between  which 
the  interaction  edge  occurred.  In  the  case  of  discarding  an 
historical  version,  the  nodes  to  be  collapsed  are  the  ones 
which  are  adjacent  through  the  branch  which  represents  the 
historical  version  to  be  discarded.  Once  two  nodes  of  a 
directed  graph  are  collapsed  it  is  necessary  to  check  for 
the  existence  of  directed  cycles  cr  self-loops  which  might 
have  been  generated  by  the  collapsing  operation.  These  cy- 
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clea  must  be  eliminated  using  the  eliminate  cycle  routine 
We  describe  now  each  of  the  above  operations. 


Cftjlajgg  Routine 


The  following  steps  must  be  taken  in  order  to  collapse 
nodes  .y  and  u  in  the  condensed  graph  Gc . 


-  Replace  nodes  v  and  w  by  a  single  node  z  such  that 
label(z)  <-  label(v)  U  label(w) 
into(z)  <-  into(v)  U  into(w) 
out(z)  <-  out(v)  U  out(w) 


S2  -  [check  for  parallel  branches]  Let 

Y(y,z)  =  {  Si:y->z  i  y  is  a  node  of  Gc  and  y  i  z} 


X(y,z)  s  {  Si:z->y  i  y  is  a  node  of  Gc  and  y  i  z) . 


For  all  nodes  y  in  Gc  such  that  iY(y,z)i  >  1 
replace  all  the  branches  in  Y(y,z) 
by  the  single  branch  S:y  ->  z 
where  S  is  the  union  of  all  the  Si's. 

For  all  nodes  y  in  Gc  such  that  ! X ( y , z ) 1  >  1 
replace  all  the  branches  in  X(y,z) 
by  the  single  branch  S:z  ->  y 
where  S  is  the  union  of  al  the  Si's. 


S3  -  [check  for  self-loops]  Let  D  =  into(z)  fl  out(z) 
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If  D  d  0  then  discard  all  historical  versions  represented 
by  the  branches  in  D. 


S4  -  [check  for  directed  cycles]  GO  TO  the 
cycle_elimination  routine. 


Routine 


The  cycle  elimination  routine  is  used  to  eliminate  from 
the  graph  Gc  all  directed  cycles.  Elimination  of  such  cy¬ 
cles  is  done  by  collapsing  the  endpoints  of  each  branch  in 
the  cycle  into  a  single  node  and  deleting  the  branches  from 
the  graph.  In  other  words,  this  operation  discards  all  the 
historical  versiona  represented  by  the  branches  in  the  cycle 
since  they  are  useless.  We  are  therefore  interested  in 
finding  the  set  of  all  branches  of  Gc  which  belong  to  a 
directed  cycle.  But  this  is  precisely  the  set  of  branches 
which  belong  to  all  strongly  connected  components  of  the 


graph .  A  strongly 


of  a  graph  G  is  a 


maximal  subgraph  of  G  such  that  there  is  a  directed  cycle 
containing  every  two  pair  of  nodes  in  the  component.  A  very 
efficient  algorithm  to  find  all  the  strongly  connected  com¬ 
ponents  of  a  directed  graph  is  given  by  Tarjan  in  [TARJ  72]. 
His  algorithm  requires  0(n,m)  space  and  time*,  where  n  is 
the  number  of  nodes  in  the  graph  and  a  is  the  number  of 


*  An  algorithm  is  said  to  require  0(n,m)  space  and  time  if 
its  space  and  time  requirements  are  bounded  by  kl*n  ♦  k2*m  + 
k3  for  some  constants  kl,  k2  and  k3. 
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branches  . 


The  cycle_elimination  routine  is  then  the  following. 

51  -  Find  all  strongly  connected  components  of  Go. 

52  -  For  each  strongly  connected  component  collapse  all 
its  nodes  into  a  single  node  and  delete  all  of  its 
branches  . 


1.1  -  Crash  Set  Calculation 

Let  us  first  characterize  a  crash  set  of  an  object  in 
terms  of  the  condensed  history  graph.  This  characterization 
will  lead  to  a  simple  and  efficient  algorithm  for  finding 
the  crash  set  of  an  object.  Some  definitions  are  in  order. 
Let  G*(a)  be  a  graph  asociated  with  object  a  such  that  the 
node  set,  V,  of  G#(a)  is  defined  as 

V  s  { w !  there  is  a  directed  path  from  lastnode(a) 
to  w  in  Gc} 

The  graph  G*(a)  can  now  be  defined  as  being  the  sub¬ 
graph  of  Gc  induced  by  the  set  of  nodes  V.  Figure  4.6  shows 
the  graph  G*(b)  for  the  condensed  history  graph  illustrated 
in  figure  4.4.  Let  us  observe  that  the  graph  G*(a)  for  any 
object  a  is  a  connected  graph  since,  by  definition,  there  is 
a  directed  path  from  lastnode(a)  to  every  node  in  the  graph. 
We  are  now  ready  to  prove  the  following  result  for  con- 
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densed  history  graphs. 


THEOREM  JL.ji.:  Let  H  =  {hi,  h2,  hk}  be  a  set  of  branches 
of  a  condensed  history  graph  Gc  such  that  no  branch  in  H  is 
in  G*(a)  and  all  of  them  are  incident  into  nodes  of  G*(a) 
for  an  object  a.  Then,  crash_set(a)  =  H. 

Proof :  The  set  H  is  clearly  a  cutset  of  Gc  since  removal  of 
the  branches  in  H  separates  G*(a)  from  the  re3t  of  the  graph 
Gc.  Since  Gc  and  G*(a)  are  connected  the  set  H  is  the 
minimal  set  which  disconnects  Gc .  We  have  to  prove  now  that 
the  branches  in  H  correspond  to  historical  versions  of  dis¬ 
tinct  objects.  Assume  that  there  are  two  branches  hi  and  hj 
in  H  which  correspond  to  historical  versions  of  the  same  ob¬ 
ject.  Assume  also,  without  loss  of  generality,  that  hi  pre¬ 
cedes  hj .  Then  there  must  be  a  directed  path  P  in  Gc  which 
traverses  hi  and  hj  in  this  order.  Let  a  be  the  node  in 
G*(a)  into  which  hi  is  incident.  But  due  to  the  existence 
of  the  path  P,  there  is  a  directed  path  from  lastnode(a)  to 
both  endpoints  of  hj.  Therefore  hj  is  in  G*(a)  and  cannot 
be  in  H.  This  fact  shows  that  it  is  impossible  for  both  hi 
and  hj  to  belong  to  the  set  H. 

So  far  we  have  proved  that  H  is  a  cutset  such  that  the 
branches  in  H  correspond  to  historical  versions  of  distinct 
objects.  But,  by  theorem  4.7,  this  set  is  a  recovery  set. 
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as  the  branches  of  H.  Let  the  most  recent  historical  ver¬ 


sion  of  object  £  be  represented  by  the  branch  £  and  let  e  6 
HI.  Let  hj  and  not  hi  be  in  HI.  Let  the  branches  hi  and  hj 
be  incident  out  of  nodes  jt  and  n  and  incident  into  nodes  x 
and  jc»  respectively.  Let  G1  and  G2  be  the  two  components  of 
G-H1.  Let  the  node  £  be  in  G1. 

Assume  that  a  cutset  such  as  HI  exists.  Since  hi  G  H, 
then  node  x  is  in  G*(a).  Therefore,  there  is  a  directed 
path  PI  in  Gc  from  lastnode(a)  into  node  x-  But  since 
lastnode(a)  and  node  x  are  in  separate  components  of  G  -  HI, 
then  at  least  one  branch  of  PI  must  be  in  HI.  This  branch 
must  be  incident  out  of  a  node  of  G2  and  incident  into  a 
node  in  G1.  But  this  is  not  possible  due  to  Lemma  4.5 
proved  below.  Therefore  it  is  not  possible  for  a  cutset 
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such  as  HI  to  exist,  which  proves  statement  ii)  and  com¬ 
pletes  the  proof  of  the  theorem. 


LEMMA  i£. 5.:  Let  Gc  be  a  condensed  history  graph.  Let  H  be  a 
cutset  of  Gc  associated  with  a  recovery  set  and  let  G1  and 
G2  be  the  two  components  of  G-H.  Let  node  a  be  in  G1.  Then 
all  the  branches  in  H  are  incident  out  of  nodes  of  G1  and 
are  incident  into  nodes  of  G2. 

Proof ;  The  proof  is  by  contradiction.  Assume  that  there  is 
at  least  one  branch  a  C  H  which  is  incident  out  of  a  node  of 
G2  and  is  incident  into  a  node  of  G1.  Recall  that  there  is 
a  directed  path  in  Gc  associated  with  every  object  and 
starting  at  node  s.  But  this  path  must  contain  at  least  two 
branches  of  H.  One  in  each  direction.  Hence,  there  are  at 
least  two  branches  of  H  associated  with  historical  versions 
of  the  same  object.  This  contradicts  the  assumption  that  H 
is  a  recovery  set  and  proves  the  lemma. 


4-1.1  *  Algorithm  for  Crash  Set  Calculation 
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consists  in  finding 
of  branches  of  the 


to  calculate  the 
the  graph  G*(a) 
condensed  histor 


not  in  G*(a)  and  that  are  incident 


crash  set  of  an  object  A 
and  then  finding  the  set 
y  graph,  Gc ,  which  are 
into  nodes  of  G*(a).  The 
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graph  G*(a)  can  be  easily  found  if  we  apply  a  graph  traver¬ 
sal  procedure  such  as  the  depth  first  search  described  in 
[TARJ  72],  to  the  graph  Gc  starting  at  t^e  node  lastnode(a). 


This  procedure  will  give  us  a  spanning  tree  for  the  graph 


G*(a).  The  depth  first  search  procedure  is  0(n,m)  and  it 
visits  each  node  of  the  graph  exactly  once  assigning  a 
number  to  it.  Before  we  describe  the  algorithm  let  us  de¬ 
fine  the  following  set  associated  to  a  given  node  v. 


S(v,a)  =  {v->w  w  C  G*(a)  } 


This  is  the  set  of  all  branches  incident  out  of  node  v 
and  incident  into  any  node  in  G*(a).  The  algorithm  is  im¬ 
plemented  by  the  procedure  CRASHSET  given  below  in  Algol- 
like  notation.  It  uses  a  depth  first  search  procedure, 
called  DFS,  to  traverse  the  graph  G*(a)  starting  at  node 
lastnode(a).  As  a  result  of  the  DFS  procedure  we  get  the 
array  nodeset  which  indicates  the  node  set  of  G*(a).  A  node 
J  of  Gc  is  in  G*(a)  if  nodeset[J]  s  true.  Also,  after  the 
end  of  the  .DFS  procedure  we  have  a  collection  of  branches 
incident  into  nodes  of  G*(a).  Some  of  these  branches  are  in 
the  graph  G*(a)  itself.  These  are  the  ones  in  S(v,a)  for 
nodes  v  in  G*(a).  Finally  we  take  the  union  of  all  the  sets 
of  branches  incident  into  nodes  of  G*(a).  This  union  is 
equal  to  crash_set(a)  by  theorem  4.8. 


Since  the  depth  first  search  procedure  is  0(n,m)  the 
CRASHSET  procedure  is  also  0(n,m). 


Let  us  now  generalize  the  notion  of  crash  set  of  an  ob¬ 
ject  to  that  of  crash  set  of  a  set  of  objects. 


PROCEDURE  CRASHSET ( a ) ; 

BEGIN 

INTEGER  i; 

BOOLEAN  ARRAY  nodeset[1:n] 

PROCEDURE  DFS(v,u);  COMMENT  node  u  is 

the  father  of  node  v  in  the  spanning  tree  of  G*(a); 
BEGIN 

numbert  v]  : =  i  : s  i  ♦  1 ; 
nodeset[v]  :=  true  ; 

FOR  EACH  x  —  > v  C  into(v) 

DO  S(x,a)  :=  S(x,a)  U  { x  —  > v } ; 

FOR  w  in  the  adjacency  list  of  v  DO 
IF  w  is  not  numbered  THEN  DFS(w,v); 

END; 

COMMENT  initialization  phase; 

i  is  0  ;  nodeset  :s  false  ;  crash_set(a)  =  0  ; 

FOR  EACH  node  x  in  Gc  DO  S(x,a)  :=  0  ; 

COMMENT  apply  depth  first  search  to  find  the 

set  of  nodes  of  Gc  which  belong  to  G*(a); 

DFS( lastnode(a) ,0) ; 

COMMENT  crash_set(a)  can  now  be  found  by  taking 
the  union  of  all  the  S(v,a)  for  nodes  v 
not  in  G* ( a) ; 

FOR  J  :*  1  STEP  1  UNTIL  n  DO 
IF  nodesetCj]  =  false 

THEN  crash__set(  a)  :  s  crash_set(a)  U  S(J,a); 

END; 

Definition  £:  ( crash  act  j al  A  3  fit.  At  Qb.lects )  :  Let  X  =  {xl  , 
x2,  .  xn}  be  a  set  of  objects.  The  crash  set  of  the  set 


C 
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X,  denoted  crash_set ( X )  ,  is  the  latest  possible  recovery 
set  which  contains  an  historical  version  of  each  object  in 


Before  we  state  and  prove  the  theorem  which  indicates 
how  to  calculate  the  crash  set  of  a  set  of  objects  let  us 
generalize  the  definition  of  the  graph  G*(a).  Let  X  =  {xl, 
x2,  ...,xm}  be  a  set  of  objects.  Let  us  define  the  graph 
G*(X)  as  the  subgraph  of  Gc  induced  by  the  nodes  in  the  set 
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V  defined  as 


V(X)  r  {v  |  v  is  in  G*(xi)  for  some  xi  Q  X}. 

THEOREM  JL.£:  Let  X  =  {x1,x2,  ...,xm}  be  a  set  of  objects. 

Let  H  =  {hi,  h2,  hlc}  be  a  set  of  branches  of  a  con¬ 

densed  history  graph  Gc  defined  as 

H  =  {v  ->  v  !  w  C  G* ( X )  and  v  t  G»(X)}. 

Then,  crash_set(X)  =  H. 

Proof :  The  proof  of  this  theorem  follows  much  the  same  line 
as  that  of  theorem  4.8.  The  set  H  is  clearly  a  disconnect¬ 
ing  set  of  Gc  since  removal  of  the  branches  in  H  separates 
G*(X)  from  the  rest  of  the  graph  Gc .  Since  Gc  and  G * ( X )  are 
connected  the  set  H  is  the  minimal  set  which  disconnects 


Gc .  We  have  to  prove  now  that  the  branches  in  H  correspond 
to  historical  versions  of  distinct  objects.  Assume  that 
there  are  two  branches  hi  and  hj  in  H  which  correspond  to 


historical  versions  of  the  same  object.  Assume  also, 


without  loss  of  generality,  that  hi  precedes  h j .  Then  there 
must  be  a  directed  path  P  in  Gc  which  traverses  hi  and  hj  in 
this  order.  Let  a  be  the  node  in  G*(X)  into  which  hi  is  in¬ 
cident.  But  w  is  in  G*(xi)  for  at  least  one  i*1,  ...,m. 


Then,  by  definition  of  G*(xi),  there  is  a  directed  path  from 
lastnode(xi)  to  w  which  concatenated  with  the  portion  of  the 
path  P  starting  at  w  implies  that  there  is  a  directed  path 


m 
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from  lastnode(xi)  which  includes  h j .  Hence,  hj  is  in  G*(xi) 


and  therefore  in  G#(X).  Therefore  hj  cannot  be  in  H.  This 
fact  shows  that  it  is  impossible  for  both  hi  and  hj  to  be  in 
the  set  H. 

So  far,  we  have  proved  that  H  is  a  recovery  set.  Let 
us  prove  now  that  H  is  the  latest  possible  recovery  set 
which  contains  an  historical  version  of  each  object  in  X. 
In  other  words,  we  want  to  show  that  there  is  no  other 
recovery  set  HI  which  contains  historical  versions  of  each 
object  in  X  and  that  contains  at  least  one  historical  ver¬ 
sion  which  postdates  the  corresponding  historical  version  in 
H.  Let  hi  and  hj  be  historical  versions  of  the  same  object 
£  and  let  hi  precede  h J .  Assume  that  hi  C  H.  Consider  now 
a  cutset  HI  of  Gc  such  that  the  branches  of  HI  are  associat¬ 
ed  with  historical  versions  of  the  same  objects  as  the 
branches  of  H.  Let  hj  and  not  hi  be  in  HI.  Let  the 
branches  hi  and  hj  be  incident  out  of  nodes  z  and  j£  and  in¬ 
cident  into  nodes  &  and  x,  respectively.  Let  G1  and  G2  be 
the  two  components  of  G-H1.  Let  the  node  £  be  in  G1. 

Assume  that  a  cutset  such  as  HI  exists.  Since  hi  C  H, 
then  node  z  is  in  G*(X).  Now,  since  node  x  is  in  G*(xi)  for 
some  i  s  l,...,m  then  there  is  a  directed  path  PI  in  Gc  from 
lastnode(xi)  into  node  x.  But  since  lastnode(xi)  and  node  z 
are  in  separate  components  of  G  -  HI,  then  at  least  one 
branch  of  PI  must  be  in  HI.  This  branch  must  be  incident 
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out  of  a  node  of  G2  and  incident  into  a  node  in  G1. 


But 


.nr 


this  is  not  possible  due  to  Lemma  4.5.  Hence,  a  cutset 
such  as  HI  cannot  exist.  This  fact  completes  the  proof  of 
the  theorem. 


Given  the  result  of  theorem  4.9  we  are  now  in  a  posi¬ 
tion  to  give  the  algorithm  to  find  crash_set ( X ) .  The  algo¬ 
rithm  consists  of  finding  the  graph  G*(X)  and  then  finding 
the  set  of  branches  of  the  condensed  history  graph,  Gc , 
which  are  not  in  G*(X)  and  that  are  incident  into  nodes  of 
G#(X).  The  graph  G*(X)  can  be  efficiently  found  by  multiple 
applications  of  a  depth  first  search  procedure.  One  appli¬ 
cation  starting  at  lastnode(x)  for  each  object  x  C  X. 
Although  the  algorithm  uses  multiple  applications  of  the 
depth  first  search  procedure,  each  node  of  the  graph  Gc  will 
still  be  visited  once  since  the  algorithm  preserves  the 
node  numbers  assigned  by  each  application  of  the  OFS  pro¬ 
cedure.  Also,  if  G*(xi)  is  a  subgraph  of  G*(xj)  for  xi  C  X 
and  xj  C  X  the  algorithm  will  not  invoke  DFS  for 
lastnode(xi)  if  DFS  was  already  invoked  for  lastnode( x j ) . 
Before  we  describe  the  algorithm  let  us  define  the  set 
S(v,X)  of  all  branches  incident  out  of  node  v  and  incident 
into  any  node  of  G*(X). 


S( v,X)  =  {v  ->  w  !  w  C  G* ( X )  } 


The  algorithm  which  is  described  below  in  Algol-like 
notation  is  0(n,m)  since  the  DFS  procedure  is  0(n,m). 
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PROCEDURE  CRASHSET ( X ) ; 

BEGIN 

COMMENT  nodesetCj]  s  true  if  node  J  is  in  G#(X) 
last[J]  a  true  if  node  j  is  lastnode 
for  any  object 

processedCj]  a  true  if  node  j  is  numbered  A 
lastCj]  s  true.  ; 

INTEGER  i; 

BOOLEAN  ARRAY  nodese t [ 1 : n ]  ,  last[1:n],  processed [ 1 : n ] ; 
PROCEDURE  DPS ( v , u ) ; 

BEGIN 

number[v]  :s  i  :s  i  ♦  1; 
nodesetCv]  :=  true  ; 

FOR  EACH  y  ->  v  C  into(v) 

DO  S ( y , X )  :=  S(y  ,X)  U  {y  ->  v}  ; 

FOR  w  in  the  adjacency  list  of  v  DO 

IF  w  is  not  numbered  THEN  DFS(w,v); 

END; 

COMMENT  initialization  phase; 
i  : a  0;  crash__set  ( X )  :a  0  ; 

FOR  EACH  node  y  in  Gc  DO  S(y,X)  :=  0  ; 

FOR  j : a  1  STEP  1  UNTIL  n  DO 

processedCj]  : =  nodesetCj]  : =  false  ; 

COMMENT  apply  depth  first  search  to  find  G*(x) 
for  each  object  x  C  X.  ; 

FOR  EACH  object  x  C  X  DO 

IF  processedC lastnode( x) ]  a  false 
THEN  DFS( last node ( x) , 0 ) ; 

COMMENT  crash_set(X)  can  now  be  obtained  by  talcing 

the  union  of  all  S(y,X)  for  nodes  y  t  G * ( X ) . 
FOR  J  : a  1  STEP  1  UNTIL  n  DO 
IF  nodesetCj]  a  false 

THEN  crash_set(X)  :a  crash_set(X)  U  S(J,X); 

END 


I*&  -  Crash  Recovery  &£  Vary  LaX-gfl  Pa.fcA.Eaata 

The  standard  technique  used  to  deal  with  errors  in  da¬ 
tabases  consists  of  periodically  dumping  the  database, 
building  a  log  of  modifications  to  the  DB  since  the  last 


dump  took  place  so  that  when  an  error  is  detected  the  data- 


base  can  be  reloaded  from  the  dump  and  the  log  processed 
against  it.  Dumping  a  database  can  be  either  done  with  the 
database  off  line,  i.e.  by  making  it  unaccessible  to  the 
users,  or  it  can  be  made  dynamically  as  suggested  by  Rosen- 
krantz  in  [ROSE  78].  Reloading  the  entire  database  from  a 
dump  or  even  reloading  only  the  portion  of  it  which  was 
modified  since  the  last  dump  took  place  is  unacceptable  in 
very  large  databases.  As  an  example,  consider  the  database 
which  contains  data  gathered  by  the  U.S.  Census.  It  is  es¬ 
timated  that  the  size  of  this  database  for  the  1980  Census 
will  be  of  the  order  of  one  Gigabyte  (10  ••  12  bytes) 
[SHOS  78].  Let  us  assume  that  an  error  in  the  database  was 
detected  and  let  us  also  assume  that  on  the  average,  ten 
percent  of  the  database  was  modified  since  the  last  dump  was 
taken.  With  state-of-the-art  technology,  typical  data 
transfer  rates  for  secondary  storage  devices  are  in  the  ord¬ 
er  of  a  few  Megabytes/sec.  Therefore,  it  would  take  of  the 
order  of  one  day  to  reload  the  portion  of  the  database  which 
was  updated  since  the  last  dump  took  place. 


Alternatively,  one  can  make  use  of  the  model  developped 
in  this  chapter  in  the  following  way.  Partition  the  data¬ 
base  into  a  number  of  logical  subdatabases.  Each  such  sub¬ 
database  is  an  object  in  our  crash  recovery  model.  An  in¬ 
teraction  edge  is  said  to  occur  between  two  objects  if  an 
update  transaction  involving  the  two  subdatabases  they 
represent  was  issued  by  the  user.  Whenever  one  or  more  sub- 
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databases  are  detected  to  be  in  error  we  can  do  crash  set 


calculation  as  described  in  the  previous  section  in  order  to 
deterainewhich  portions  of  the  database  should  be  reloaded. 
As  we  see,  this  approach  tries  to  make  an  assessment  of  the 
extent  of  the  damage  therefore  allowing  us  to  reload  smaller 
amounts  of  data  than  with  the  previous  technique. 


Note  that  the  choice  of  the  partitions  will  determine 
to  which  degree  one  is  able  to  reduce  the  portion  of  the  da¬ 
tabase  to  be  reloaded.  Theoretically,  if  one  partitions  the 
DB  in  a  way  that  each  subdatabase  is  the  smallest  updatable 
data  unit,  then  the  crash  set  would  give  us  exactly  the  por¬ 
tion  of  the  database  that  was  affected  by  the  error.  Such  a 
fine  grain  is  not  feasible  in  practice  and  we  will  typically 
have  subdatabases  of  much  coarser  grain.  Typical  examples 
could  be  whole  relations  or  domains  of  a  relation.  At  any 
rate,  one  should  try  to  group  in  the  same  subdatabase,  por¬ 
tions  of  the  DB  which  tend  to  appear  together  in  update  re¬ 
quests  . 


a. a  - 


The  condensed  history  graph  introduced  in  this  chapter 
as  a  formal  model  of  crash  recovery  is  important  because  of 
the  following  aspects: 


i.  once  the  effect  of  the  information  flow  between 
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two  objects  has  been  taken  into  account  in  the 
graph,  a  record  of  it  need  not  be  stored  anymore 
for  crash  recovery  purposes.  This  fact  is  ex¬ 
tremely  important  not  only  because  it  minimizes 
the  storage  requirements  for  the  system  but  be¬ 
cause  the  information  flow  pattern  is  crucial  for 
correct  crash  recovery.  Therefore  if  records  of 
it  needed  to  be  stored,  extremely  reliable  mechan¬ 
isms  should  be  provided  to  store  them. 

all  the  snapshots  or  historical  versions  which  are 
useless  because  they  cannot  participate  in  any 
crash  set  can  be  detected  to  be  so.  Therefore 
they  can  be  discarded. 

crash  set  calculation  can  be  efficiently  done. 


CHAPTER  5 

CRASH  RECOVERY  IN  DISTRIBUTED  SYSTEMS 


1.1  -  Introduction 


In  the  previous  chapter  we  showed  how  to  efficiently 
calculate  the  crash  set  for  a  set  of  objects  given  the  con¬ 
densed  history  graph  Gc .  The  same  approach  cannot  be  used 
anymore  in  a  distributed  environment  since  we  cannot  assume 
that  there  is  a  single  complete  version  of  the  graph  Gc  and 
moreover  we  cannot  assume  that  the  graph  is  being  updated 
continuously  to  reflect  events  such  as  generation  of  new 
historical  versions,  occurrence  of  interaction  edges  and  the 
like.  The  existence  of  a  single  centralized  version  of  the 
graph  is  ruled  out  due  to  both  reliability  considerations 
and  due  to  the  fact  that  maintaining  such  a  graph  up  to  date 
would  require  that  all  purely  local  interaction  edges  be 
sent  to  the  site  which  kept  the  complete  version  of  the 
graph  Gc .  This  approach  implies  in  a  message  traffic  over¬ 
head  which  is  unacceptable.  Instead  of  having  a  centralized 
complete  version  of  the  graph  Gc  we  are  going  to  assume  that 
the  global  graph  is  partitioned  into  local  subgraphs  distri¬ 
buted  among  the  several  sites  of  the  network.  In  the  fol¬ 
lowing  section  we  will  describe  the  criteria  used  in  parti¬ 
tioning  the  graph  into  subgraphs. 
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the  site  where  the  process  is  currently  running. 


If  the  object  £  is  a  file  then  location(x)  is  the 
site  where  the  file  is  being  currently  accessed 
for  update  or  the  site  where  it  was  last  accessed 
for  update  if  the  file  is  only  being  used  for 
read . 


The  following  criterium  is  used  to  partition  the  global 
graph  Gc  into  subgraphs: 


V  object  £  ,  SG(x)  is  located  at  location(x). 


As  a  consequence  of  the  above  criterium  it  follows  that 
the  subgraphs  associated  with  processes  running  at  the  same 
site  are  all  located  at  that  site.  Therefore  all  interac¬ 
tion  edges  between  them  can  be  reflected  locally.  In  addi¬ 
tion  we  will  require  that  in  order  for  a  file  F  to  start  be¬ 
ing  used  for  update  by  process  P  it  is  necessary  that  the 
subgraph  SG(F)  to  be  brought  over  to  location(P)  *.  As  a 
consequence  all  interaction  edges  between  processes  and 
files  can  be  reflected  locally.  Therefore,  the  only  in¬ 
teraction  edges  which  cross  site  boundaries  are  those  which 
represent  interprocess  communication  between  remote 
processes  . 


•  This  is  not  a  necessary  requirement  and  all  the  solutions 
presented  in  this  chapter  work  correctly  without  it. 


mmmmm 


Given  that  the  global  condensed  history  graph  Gc  is 
partitioned  into  local  subgraphs  Gloc  at  each  site  we  need  a 
protocol  by  which  updates  to  the  graph  are  appropriately 
carried  out  . 


1-1  -  Update  the.  Local  Subgraphs . 

We  will  describe  in  this  section  the  algorithms  used  to 
update  the  local  subgraphs  to  reflect  the  following  three 
operations . 


a.  addition  of  a  new  historical  version 

b.  addition  of  an  interaction  edge 

c.  intentional  or  accidental  loss  of  an  historical 
version . 
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Since  we  are  assuming  that  the  subgraph  SG(x)  is  total¬ 
ly  contained  in  Gloc  for  the  site  location(x),  the  operation 
of  adding  a  new  historical  version  is  purely  local  to  Gloc. 
This  operation  was  described  in  section  4.6.1  of  the  previ¬ 


ous  chapter. 


Here  we  have  to  distinguish  between  two  cases.  In  one 
of  them  the  interaction  edge  occurs  between  two  nodes  of  the 
same  subgraph  Gloc.  In  this  case  the  operation  is  strictly 
local  and  is  described  in  section  4.6.2  of  the  previous 
chapter.  We  will  restrict  in  this  case  the 
Cycle_Elimination  routine  to  the  local  subgraph  Gloc.  While 
cycles  which  cross  site  boundaries  may  go  undetected,  the 
cost  of  detecting  them  every  time  that  new  potential  cycles 
are  generated  may  be  fairly  large.  Alternatively  one  could 
"glue"  all  the  subgraphs  at  a  single  site  periodically  (e.g. 
once  a  day,  once  a  month  or  once  a  year)  and  detect  all  the 
inter  site  cycles  discarding  therefore  the  additional  use¬ 
less  historical  versions. 


The  second  case  is  the  one  in  which  the  interaction 
edge  occurs  between  two  objects  located  at  different  sites. 
In  this  case  we  are  not  going  to  collapse  the  two  nodes  in¬ 
volved  in  the  interaction  edge  right  away  but  we  will  batch 
these  operations.  Collapsing  two  remote  nodes  requires  at 
least  one  message  in  addition  to  the  one  associated  with  the 
interaction  edge  itself.  Therefore  collapsing  nodes  every 
time  an  interaction  edge  occurs  at  least  doubles  the  traffic 
in  the  network.  The  collapsing  operation  will  be  carried 
out  with  the  aid  of  an  algorithm  which  will  execute  periodi¬ 
cally  in  the  background  without  interfering  with  the  normal 


computation . 


Let  £  and  £  be  two  objects  located  at  different  sites 
and  let  ii.  and  £j.  be  the  nodes  associated  with  £  and  £ 
between  which  the  interaction  edge  took  place.  We  will 
denote  an  interaction  edge  as  a  3-tuple  (xi,yj,t)  where  £.  is 
a  timestamp  which  indicates  the  time  at  which  the  interac¬ 
tion  between  x  and  y  occurred.  Note  that  there  is  always  a 
generator  of  an  interaction  edge.  For  instance,  if  the  in¬ 
teraction  edge  represents  an  IPC  then  the  sender  process  is 
the  generator  of  the  interaction  edge.  Similarly  if  a  pro¬ 
cess  performs  an  operation  on  a  file  then  the  process  is  the 
generator  of  the  interaction  edge.  Note  also  that  at  site 
location(x)  the  identity  of  the  node  associated  with  object 
£  ,  namely  xl,  in  the  interaction  edge  (xi,  yj ,  t)  is  not 
known.  Given  these  considerations  the  algorithm  used  to  add 
the  interaction  edge  (xi,  yj ,  t)  is  simply  the  following. 


jjLL  -  Add  to  Gloc  at  location(x)  an  undirected  branch  t: 
xi  <->  y.  Notice  that  since  yj  is  not  known  to  the  gen¬ 


erator  of  the  interaction  edge  the  name  of  the  object,  y, 
is  used  to  label  one  of  the  endpoints  of  the  added 
branch.  The  node  y  will  be  called  a  non-local  node  to 
Gloc(location(x)) . 
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-  Add  to  Gloc  at  location(y)  an  undirected  branch  t:  x 
<->  yj .  In  general,  it  is  not  known  at  location(y)  the 
identity  of  node  xi.  The  node  x  is  a  non  local  node  to 
Gloc(location(y)  )  . 

The  above  operations  are  illustrated  in  figure  5.1. 
During  the  collapse  operation,  to  be  described  next,  nodes 
xi  and  yj  will  be  collapsed  into  a  super  node  labeled 
txi  ,yj } .  Note  that  the  timestamp  £  along  with  the  pair 
(x,y)  serves  to  indicate  which  nodes,  namely  xi  and  y j , 
should  be  collapsed. 

Since  a  fairly  large  number  of  interaction  edges  are 
typically  generated  per  unit  of  time  we  do  not  want  to  keep 
the  super  nodes  split  apart.  Therefore,  periodically  a  pro¬ 
tocol,  called  the  Collapaing  Protocol .  will  be  executed  to 
collapse  nodes. 

1-1-2- 1  -  Collapsing  Erg La cal 

Before  we  describe  the  Collapsing  Protocol  we  must  in¬ 
troduce  the  notions  of  sublog  and  global  log  and  also 
describe  the  messages  involved  in  the  protocol. 

Every  site,  Si,  keeps  a  list  or  sublog .  SL(Si),  of  in¬ 
teraction  edges  which  occurred  after  the  previous  collapsing 
operation  took  place.  Let  Tpre vlous  be  such  a  time.  An  en¬ 


try  in  this  sublog  is  of  the  form: 


{  (xi,y,t),  into(xi),  out(xi)  } 


where  xi  is  a  node  in  the  subgraph  at  site  Si,  (xi,y,t)  is 
an  interaction  edge  and  into(xi)  and  out(xi)  are  as  defined 
in  the  previous  chapter. 


The  collapsing  operation  consists  of  two  parts.  During 
the  first  one,  a  global  log .  GL ( T ) ,  of  all  the  interaction 
edges  which  occurred  after  Tprevious  and  no  later  than  a 
certain  time  T  is  constructed.  The  set  of  all  the  nodes 
that  should  be  collapsed  as  a  result  of  the  interaction 
edges  represented  in  the  global  log  is  then  calculated  and 
the  relevant  subgraphs  are  updated  in  a  subsequent  part. 


The  algorithm  used  to  calculate  the  set  of  collapsed 
nodes  and  their  sets  of  input  and  output  branches  is 
straightforward.  The  input  to  this  algorithm  is  a  global 
log  GL(T)  given  as  a  sequence  of  interaction  edges  of  the 
form  (xl,y,t)  as  defined  previously.  Also,  the  global  log, 
which  is  the  union  of  sublogs,  contains  the  into  and  out 
sets  for  all  the  nodes  which  appear  as  end  nodes  of  interac¬ 
tion  edges  in  GL(T). 


Even  though  a  tuple  such  as  (xi,  y,  t)  does  not  specify 
completely  an  interaction  edge  the  global  log  has  the  pro¬ 
perty  that  if  there  is  a  tuple  (xi,  y,  t)  in  GL(T)  then 
there  must  also  be  a  tuple  (x,  y j ,  t)  in  GL(T).  This  im¬ 
plies  that  the  interaction  edge  occurred  between  nodes  xi 


and  yj .  These  nodes  should  therefore  be  collapsed  Into  a 
single  node. 


The  procedure  to  find  the  set  of  all  the  collapsed 
nodes  consists  basically  of  first  matching  all  the  interac¬ 
tion  edges  in  GL(T).  Then  a  super  node  is  formed  by  col¬ 
lapsing  a  maximal  set  of  nodes  connected  by  interaction 
edges  in  the  global  log.  The  label  of  a  super  node  is  the 
list  of  the  names  of  all  its  component  nodes.  The  set  of 
branches  incident  into  a  super  node  z  is  the  union  of  the 
sets  of  branches  which  are  incident  into  the  component  nodes 
of  z  from  nodes  other  than  these.  Similarly,  the  set  of 
branches  incident  out  of  z  is  the  union  of  the  sets  of 
branches  which  are  incident  out  of  the  component  nodes  of  z 
but  which  are  not  incident  into  these  nodes.  Self-loops  and 
parallel  branches  should  be  eliminated  here. 

The  following  example  illustrates  the  above  described 
procedure.  Consider  three  objects  a,  b  and  c  and  consider 
the  global  log  shown  below. 

interaction  edges:  (al.bl.tl),  (a*,c,t3)»  (b1,a,t1), 

( b 1 , c , t2 ) ,  ( c 1 , b , t2 ) ,  ( c* , a  ,  t3 )  . 

ad.lacencv  sets  : 

into(al)  =  { aO  ->  a  1 } ;  out(al)  =  { a  1  ->  a*} 

into(a»)  =  {al  ->  a*};  out(a»)  =  0 

into ( bl )  =  {bO  ->  b 1 } ;  out(bl)  =  { b 1  ->  b*} 


into(cl)  s  {cO  ->  cl};  out(cl)  =  {cl  ->  c*} 
into(c#)  =  {cl  ->  c*};  out(c»)  =  0 

Figure  5.2  shows  the  two  super  nodes  obtained  from  the 
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global  log  above.  Figure  5. 2. a  shows  the  construction  of 
the  super  nodes,  figure  5.2.b  shows  the  construction  of  the 
into  and  out  sets  for  each  of  them  and  finally  figure  5.2.c 
shows  the  two  super  nodes  with  their  into  and  out  sets  and 
with  parallel  branches  already  eliminated. 

(  The  collapsing  protocol  uses  two  types  of  messages, 

I 

'  called  BUILD_LOG  or  BLOG  and  COLL APSE_REQUEST  or  CREQ.  The 

BLOG  message  is  used  during  the  first  part  of  the  protocol 
to  collect  all  the  entries  in  the  global  log  GL(T).  This 
message  has  four  parameters  as  defined  below. 

is  the  identification  of  the  site  which  generated 
the  BLOG  message, 

1  is  the  time  instant  up  to  which  new  interaction  edges 
are  going  to  be  considered  for  the  global  log  to  be 
constructed  (this  time  instant  will  be  called  the  time 
limit)  , 

£JL  or  Circulation  List  is  the  list  of  sites  to  which 
the  message  must  be  sent.  The  BLOG  message  will  visit 
the  sites  in  CL  sequentially.  This  list  will  be  updat¬ 
ed  as  the  message  passes  through  the  several  sites  as 


will  be  seen  later. 


PGL  is  a  Partial  Global  Log.  When  all  the  sites  in  the 
Circulation  List  have  already  been  visited  and  the  list 
cannot  be  updated  anymore  then  PGL  is  equal  to  the 
desired  global  log  GL(T). 


The  pair  (Si,T)  serves  a  unique  netwide  identifier  for 
the  BLOG  message. 


The  CREQ  message  is  used  in  the  second  part  of  the  pro¬ 
tocol  to  send  the  set*  of  collapsed  nodes  which  resulted  from 
GL(T)  to  all  the  sites  which  contributed  to  the  construction 
of  the  global  log.  This  message  has  the  four  parameters  de¬ 
fined  below. 


£1  is  the  identification  of  the  site  which  generated 
the  CREQ  message, 


1  and  are  the  same  as  defined  for  the  BLOG  message 


SSN  or  Set  of  Super  Nodes  is  the  set  of  super  nodes  ob¬ 
tained  as  a  result  of  collapsing  nodes  according  to  the 
interaction  edges  in  GL(T). 

At  any  site  there  may  be  several  messages  (of  either 
kind)  at  any  given  time.  Messages  are  assigned  one  of  the 
following  four  statuses: 


1.  OAS:  Outstanding  and  Already  Sent. 


«r.  ✓  a  /.  /«  -V  **'.  A  r.  A  V.  <  ,  -  ,  *  „  -  .  «  .  *  .  »  .  *  . 


2.  SAS :  Saved  and  Already  Sent 


3.  STBS:  Saved  To  Be  Sent 


4.  0PH1:  Outstanding  PHase  1 


Each  site  Si  maintains  a  queue,  Q(Si),  of  saved  mes 
sages,  i.e.  messages  which  have  either  status  SAS  or  STBS 
At  each  site  Si  there  is  at  most  one  outstanding  message 
denoted  Mouts(Si),  which  has  either  status  OAS  or  0PH1.  Th 
assignment  of  statuses  to  messages  will  become  clear  as  w 
describe  the  protocol. 


The  collapsing  protocol  is  composed  of  two  parts:  a 


global  log 


part  and  a  graph  update  part.  The 


global  log  construction  part  can  be  started  by  any  site. 
The  starting  site  will  select  a  time  limit  T  and  will  gen¬ 
erate  a  BUILD_LOG  message  which  will  travel  sequentially 
through  all  the  sites  which  contain  the  subgraphs  associated 
with  objects  which  interacted  since  Tprevious  up  to  time  T. 
During  its  course  the  BLOG  message  will  gather  the  sublogs 
accumulated  at  each  relevant  site  in  order  to  produce  the 
global  log. 


Since  another  site  may  have  started  the  same  process 
with  a  possibly  different  time  limit  T'  we  have  a  race  con- 


dition  which  is  resolved  by  giving  priority  to  the  message 
with  the  smallest  time  limit.  Notice  that  no  synchroniza¬ 
tion  messages  are  necessary  to  deal  with  this  race  condi¬ 
tion  . 


Consider  two  messages  Ml  and  M2  with  time  limits  T1  and 

T2  respectively.  Let  T1  <  T2.  If  Ml  arrives  at  a  given 

site  and  finds  that  M2  is  outstanding  with  status  OAS  then 

M2  will  be  saved  with  status  SAS  and  Ml  will  be  allowed  to 
% 

proceed  becoming  outstanding  at  that  site.  If  M2  arrives  at 
a  given  site  and  finds  Ml  as  outstanding  message  then  M2  is 
saved  with  status  STBS. 


Once  all  the  sites  have  contributed  their  share  to  the 
global  log,  the  last  site  to  make  a  contribution  will  have 
the  complete  log.  It  then  determines  how  nodes  should  be 
collapsed  according  to  the  interaction  edges  indicated  by 
the  global  log.  The  result  of  this  computation  is  essen¬ 
tially  a  set  of  updates  that  should  be  done  in  all  relevant 
subgraphs.  These  updates  are  carried  out  through  the  use  of 
a  "nested  two-phase  commit"  protocol  [GRAY  78]  since  we  want 
to  make  sure  that  either  all  the  subgraphs  are  updated  or 
none  of  them  is.  In  other  words  we  want  this  update  to  be  an 
atomic  operation.  This  atomic  update  is  accomplished  by  the 
CREQ  message  which  will  circulate  through  the  set  of  sites 
which  participated  in  forming  the  global  log. 


When  the  subgraph  at  site  Si  is  updated  to  reflect  all 
the  interaction  edges  up  to  time  T,  the  queue  of  saved  mes¬ 
sages  at  this  site  is  examined.  If  the  queue  is  not  empty 
we  select  the  message  with  the  smallest  time  limit  and  make 
it  the  outstanding  message.  If  its  status  is  SAS  its 
status  is  then  changed  to  OAS.  If  on  the  other  hand  its 
status  is  STBS  we  must  allow  the  message  to  continue  its  way 
by  sending  it  to  the  next  site  to  be  visited.  But  before 
the  message  continues  one  must  delete  from  its  partial  glo¬ 
bal  log  all  the  entries  which  have  a  timestamp  not  greater 
than  Tprevious  which  is  now  updated  to  the  time  T. 


In  the  following  subsection  we  give  an  algorithmic 
description  of  the  protocol. 


i-l'Z.l  - 


The  collapsing  protocol  as  executed  at  site  Si  is 
described  by  the  rules  given  below.  It  is  assumed  that 
there  is  a  virtual  circular  order  of  the  sites  such  that 
when  a  message  in  this  protocol  has  to  be  sent  to  a  list  of 
sites,  it  is  sent  to  the  site  in  the  list  which  is  next  in 
the  circular  order. 


? 


j 


Rule  J.:  Site  Si  decides  to  start  a  collapsing  operation. 
There  can  be  no  outstanding  messages  at  site  Si.  Let  Tclock 
be  the  value  of  the  clock. 


5J..J.  -  Select  a  time  limit  T  >  Tprevious  (typically  T 
will  be  selected  to  be  equal  to  Tclock) . 


5J..2.  -  Build  a  BLOG  message  M  with  the  partial  global  log 
PGL(M)  initialized  to  the  sublog  SL(Si)  and  with  the  cir¬ 
culation  list  CL(M)  initialized  to  the  list  of  sites 
which  contain  the  subgraph  SG(x)  if  is  an  object  in¬ 
volved  in  SL(Si).  Mark  site  Si  as  "visited"  in  CL(M). 


5J..3.  -  Send  M  to  the  next  non-visited  site  in  CL(M). 


jSJ..]L  -  Make  M  the  new  outstanding  message  at  site  Si  and 
assign  to  it  status  equal  to  OAS  (Outstanding  and  Already 
Sent )  . 


Rule  Z.:  A  BLOG  message  M  =  (Sj,  T,  CL,  PGL)  is  received  at 
site  Si. 


JL2..1  -  If  there  is  no  outstanding  message  at  Si  then  go 
to  step  S2.4. 
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52-2  -  If  the  outstanding  message  at  Si  has  a  time  limit 
T'  <  T  then  mark  site  Si  as  "visited"  in  CL(M),  save  the 
incoming  message  in  the  queue  Q(Si)  and  assign  to  it  the 
status  STBS  (Saved  To  Be  Sent).  Stop. 

52.3.  -  [at  this  step  there  is  an  outstanding  message  with 
status  OAS  and  time  limit  2.  T]  Save  the  outstanding  mes¬ 
sage  in  Q(Si)  and  assign  to  it  status  SAS  (Saved  Already 
Sent) .  The  incoming  message  M  is  now  the  outstanding 
message . 

52-iL  -  [process  incoming  message]  Add  the  sublog  SL(Si) 
to  PGL(M).  Mark  Si  as  "visited"  in  CL(M).  Add  to  CL(M) 
the  set  of  all  the  additional  sites  which  are  involved 
because  of  interaction  edges  in  the  sublog  SL(Si). 

52-5.  -  If  there  are  non  visited  sites  in  CL(M)  then  send 
the  message  M  to  the  next  site  in  CL(M),  make  status(M)  = 
OAS,  and  stop. 

52*4  -  [all  the  sites  in  CL(M)  are  visited:  the  partial 
global  log,  PGL(M),  is  now  the  global  log  GL ( T ) ]  Find  the 


set,  SSN  ,  of  all  the  super  nodes  derived  from  the  global 


s  *» 


Stl.l  -  [the  "graph  update"  phase  can  now  be  started] 
Build  a  CREQ  message  M*  =  (Si,  T,  CL,  SSN )  where  CL(M')  = 
CL(M)  and  the  time  limit  T  is  the  same  as  in  the  incoming 
message  M.  Mark  all  the  sites  in  CL(M')  as  "non-visited" 
excluding  site  Si. 

£2.. <L  -  Send  the  CREQ  message  to  the  next  site  in  CL(M'). 
The  message  M'  is  now  the  new  outstanding  message  and  is 
assigned  status  =  0PH1  (Outstanding  PHase  1). 

Rule  2. :  A  CREQ  message  M  =  (S,  T,  CL,  SSN)  is  received  at 
site  Si. 

53-1  -  If  there  is  an  outstanding  message  M'  at  site  Si 
which  is  not  a  CREQ  message  then  make  the  incoming  mes¬ 
sage  M  the  new  outstanding  message  with  status  equal  to 
0PH1  and  discard  message  M*  else  go  to  step  S3-1*  . 

53. 2  -  Write  intentions  list  for  updates  to  the  local 
subgraph  as  indicated  by  SSN(M).  Set  Tprevious  to  T. 

53*3  -  Mark  Si  as  "visited"  in  CL(M)  and  send  M  to  the 
next  non  visited  site  in  CL(M)  if  there  is  any  otherwise 
send  it  to  site  S.  Stop. 
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£2.4.  -  [the  outstanding  message  is  a  CREQ  message  with  S 
=  Si:  end  of  phase  1  of  two-phase  commit]  If  the  out¬ 
standing  message  M'  is  equal  t  the  incoming  message  then 
mark  the  sites  in  CL(M)  as  "non-visited"  except  for  Si. 

£2.2  -  [The  outstanding  message  is  a  CREQ  message  with  S 
i  Si]  Carry  out  local  updates  on  local  subgraph  as  indi¬ 
cated  by  the  set  SSN  in  the  message  M.  Mark  Si  as 
"visited"  in  CL(M) . 

22- &  -  [adjust  sublog  to  account  fo-  update  in  local  sub¬ 
graph]  Delete  from  sublog  SL(Si)  all  the  entries  wit  a 
timestamp  i  Tprevious  . 

22*2  -  Forward  M  to  the  next  non  visited  site  in  CL(M)  if 
there  is.  any. 

22*5.  -  Discard  M  and  select  the  message  M' '  in  Q(Si)  with 
the  smallest  time  limit  (if  any)  to  be  the  outstanding 
message . 


£2*2  -  If  statusCM'1)  =  SAS  then  change  its  status  to 


-  If  statusC  M'  '  ) 


STBS  then  delete  from  PGL ( M ' ' ) 


all  the  entries  with  timestamp  <.  Tprevious,  forward  M'f 
to  the  next  non  visited  site  in  CL(M'')  and  change 
status( M' ' )  to  OAS . 

Figure  5-3  illustrates  how  the  subgraphs  of  figure  5.1 
look  like  after  the  collapsing  protocol  has  been  executed. 
All  the  nodes  flagged  with  a  star(*)  are  called  non  local 
nodes  since  they  do  not  belong  to  the  subgraph  of  any  local 
ob j  ect  . 

5.-1-1-1  -  Collapsing  Protocol  -  An  Assertion 

Assertion:  The  collapsing  protocol  applies  the  updates  gen¬ 
erated  by  global  logs  in  increasing  chronological  order  of 
time  limit. 

Proof:  We  will  say  that  a  global  log,  GL(T),  completes  when 
the  BLOG  message  has  already  passed  through  all  the  relevant 
sites.  In  order  to  prove  this  assertion  it  is  only  neces¬ 
sary  to  prove  that  global  logs  complete  in  increasing  chro¬ 
nological  order.  Assume  not.  Then  there  are  two  global 
logs  GL ( T 1 )  and  GL(T2)  with  T1  <  T2  and  such  that  GL(T2) 
completed  before  GL ( T  1 )  did.  Since  T2  >  T1,  GL ( T 1 )  £  GL ( T2 ) 
and  therefore,  all  the  sites  that  must  be  visited  by  the 
BLOG  message  associated  with  GL(T1)  must  also  be  visited  by 
the  BLOG  message  asssociated  to  GL(T2).  But  since  priority 
is  given  to  messages  of  smaller  time  limit,  BLOG(TI)  could 
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Figure  5.3  -  Example  of  tbe  Application  c 
Collapse  Protocol  to  the  Graph  of  Figure  5.1 
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crlterium  used  to  decompose  the  global  graph  Gc  such  end 
nodes  are  both  located  in  the  rame  subgraph.  Therefore, 
this  operation  is  strictly  local  and  was  described  in  sec- 


Crash  Set 


5.-1  -  Distributed 

One  of  the  major  results  of  the  previous  chapter  is 
that  given  a  set  of  objects  X,  crash_set(X)  is  given  by  the 
set  of  branches  incident  into  nodes  of  the  graph  G*(X).  The 
graph  G*(X)  was  found  by  using  a  depth  first  search  (DFS) 
graph  traversal  procedure.  In  our  case,  in  which  the  global 
graph  is  distributed,  we  have  to  find  a  distributed  algo¬ 
rithm  to  find  the  graph  G*(X)  and  consequently  crash_set ( X ) . 
This  algorithm  can  be  intuitively  described  as  follows. 

Let  Si  be  a  site  where  an  error  was  detected.  Let  X  be 
the  set  of  objects  detected  to  be  in  error.  The  local  sub¬ 
graph  at  Si,  Gloc(Si),  will  be  traversed  in  a  way  similar  to 
the  one  described  in  the  previous  chapter.  The  difference 
now  is  that  in  each  local  subgraph  there  are  two  types  of 
nodes,  namely  local  and  non-local  ones. 

The  DFS  procedure  will  be  executed  at  site  Si  and  each 
time  that  a  non-local  node  is  visited  it  will  be  marked. 
These  marked  nodes  are  called  output  connections  and  may  be 
used  as  starting  nodes  or  roots  for  the  execution  of  a  DFS 
procedure  at  a  foreign  site. 

Once  all  the  possible  nodes  in  Gloc(Si)  have  been 
visited  we  have  a  collection  of  output  connections  and  a  set 
of  sites  where  these  output  connections  are  local  nodes.  At 
each  of  these  sites,  the  DFS  procedure  is  invoked  for  each 
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of  the  roots  provided  that  they  have  not  been  already  visit¬ 
ed  in  any  of  the  previous  invocations  of  the  DFS  procedure 


at  that  site.  Again  if  non-local  nodes  are  marked  the  pro¬ 
cess  repeats  itself.  If  however,  only  local  nodes  are 
visited  at  a  given  site  by  the  execution  of  the  DFS  pro¬ 
cedure  then  this  site  returns  to  its  "calling  site"  the  set 
of  nodes  visited  locally  as  well  as  those  visited  at  sites 
called  by  this  site.  By  "calling  site"  we  mean  the  one 
which  triggered  the  execution  of  the  DFS  procedure  at  a 
given  site.  Eventually,  the  first  site  to  start  the  crash 
set  calculation  will  receive  from  the  sites  which  were  ini¬ 
tially  triggered  by  it  sets  of  nodes  the  union  of  which  is 
the  node  set  of  G*(X). 

The  above  algorithm  will  be  best  understood  in  the 
light  of  an  example.  Consider  the  global  graph  Gc  of  figure 
5.4.  Graphic  conventions  were  used  to  denote  how  Gc  is  par¬ 
titioned  among  three  sites  SI,  S2  and  S3.  Nodes  of  Gc  are 
labeled  with  node  numbers  which  are  enclosed  in  brackets. 
Let  us  assume  tnat  we  want  to  find  crash_set(X)  where  X  con¬ 
tains  a  single  object  associated  with  node  1  at  site  SI. 
The  progress  of  the  algorithm  can  be  graphically  summarized 
by  a  tree,  called  the  Progress  Summary  Tree  (PST)  .  in  which 
each  node  is  associated  with  a  possible  root  for  a  DFS  pro¬ 
cedure  invocation.  The  sons  of  a  node  £  in  the  tree  are  the 
output  connections  found  when  applying  DFS  starting  at  node 
£.  A  node  £  in  this  tree  is  labeled  by  the  set  of  nodes 
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visited  by  the  DFS  procedure  if  jt  was  used  as  a  root. 


In  the  example  we  are  considering,  DFS  is  applied  at 
site  SI  starting  at  node  1.  Three  output  connections  are 
found,  namely  node  11  at  site  S3  and  nodes  5  and  9  at  site 
S3*  The  PST  at  this  moment  is  shown  in  figure  5. 5. a  . 


A  message  is  sent  to  site  S2  containing  the  root  1 1  and 
another  message  is  sent  to  site  S2  containing  the  roots  5 
and  9-  At  site  S3,  the  DFS  procedure  is  applied  twice:  once 
starting  at  node  9  and  another  starting  at  node  5.  Two  out¬ 
put  connections  are  found,  namely  nodes  11  and  14  at  site 
S2.  A  message  containing  these  nodes  is  sent  to  site  S2 . 
Notice  that  node  11  had  already  been  marked  as  an  output 
connection  at  site  SI.  Since  nodes  already  visited  by  the 
execution  of  the  DFS  procedure  are  not  used  as  roots,  node 
11  will  be  used  as  a  root  only  once.  Let  us  assume  in  this 
case  that  node  11  was  used  as  a  root  in  site  S2  as  a  result 
of  the  message  from  SI.  An  output  connection  to  node  4  is 
found  and  a  message  is  sent  to  site  SI.  The  PST  up  to  this 
point  is  given  in  figure  5.5.b  . 


Since  nodes  11  and  14  were  visited  while  executing  the 
DFS  procedure  at  site  S2  upon  request  of  site  SI,  the  re¬ 
quest  of  S3  to  S2  has  no  effect  and  this  completes  all  the 
processing  that  could  possibly  be  carried  out  at  S2  or  at 
sites  invoked  by  it.  Therefore,  S2  sends  a  message  to  SI 
with  the  set  of  nodes  visited  at  S2  and  at  the  sites  invoked 
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Figure  5.5  -  PSTs  for  the  Application  of  the  Dis¬ 
tributed  Crash  Set  Calculation  Algorithm  to  the 
Graph  of  Figure  5.4. 


by  S2.  This  set  is  {5,  6,  7,  9,  10}.  At  site  SI,  the  ap¬ 
plication  of  the  DFS  procedure  yields  no  output  connection. 
Figure  5.5.C  depicts  the  progress  summary  at  this  point. 

Now,  site  SI  returns  to  site  S2  (see  figure  5-5.d)  and 
finally  site  S2  returns  to  site  SI  (see  figure  5.5.e).  The 
label  of  node  1  in  figure  5.5.e  is  the  set  of  nodes  of 
G  *  ( X  )  . 


The  algorithm  Just  described  exhibits  a  fairly  large 
degree  of  parallelism  since  it  performs  all  possible  local 
computation  and  then  triggers  foreign  computation  at  several 
sites  to  explore  all  the  possible  concurrency  inherent  to 
the  problem.  The  overall  graph  traversal  procedure  can  be 
viewed  as  a  mixture  of  depth  first  search  for  intra  site 
graph  traversal  and  branch  and  bound  for  inter  site  graph 
traversal. 

An  Algol-like  description  of  the  algorithm  is  given  in 
Appendix  B. 
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This  chapter  addressed  the  problem  of  finding  the  crash 
set  of  a  set  of  objects  in  distributed  environments.  Since 
in  distributed  systems  we  cannot  assume  that  there  is  a 
single  and  complete  copy  of  the  condensed  history  graph  Gc , 
a  partitioning  criterium  is  given  whereby  each  site  main- 
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tains  a  subgraph  of  Go.  A  protocol  to  update  these  sub¬ 
graphs  when  interactions  edges  are  generated  was  given  here. 
Also,  a  distributed  algorithm  to  find  the  crash  set  was  in¬ 
tuitively  described  in  this  chapter,  while  a  complete 
description  of  the  algorithm  can  be  found  in  Appendix  B  of 
this  dissertation.  The  protocol  explores  all  the  possible 
parallelism  which  exists  due  to  the  fact  that  the  graph  Gc 
is  partitioned  and  that  there  are  several  sites  which  may  be 
concurrently  doing  their  part  in  the  crash  set  calculation. 
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CHAPTER  6 


THE  UCLA  DISTRIBUTED  SECURE  SYSTEM  BASE 


£.1  - 


In  this  chapter  we  will  show  how  the  techniques  and 
models  for  crash  recovery  discussed  in  the  two  previous 
chapters,  were  used  in  the  design  of  the  UCLA  Distributed 
Secure  System  Base  (DSSB).  The  UCLA  DSSB,  which  is 
described  in  detail  in  [MENA  78c],  is  intended  as  a  base  for 
secure  and  reliable  distributed  computing.  The  belief  is 
that  app licationons  such  as  office  automation  systems,  dis¬ 
tributed  database  management  systems  and  the  like  can  be 
built  more  easily  on  top  of  a  base  such  as  the  DSSB. 


The  system  can  be  thought  as  a  collection  of  loosely 
coupled  homogeneous  processors.  Running  at  each  processor 
there  is  a  slightly  modified  version  of  the  UCLA  Data  Secure 
Unix  Security  Kernel  [KAMP  77],  which  is  a  portion  of  the 
UCLA  Data  Secure  Unix  operating  system.  The  kernel  contains 
all  the  mechanisms  necessary  to  enforce  authorized  access  to 
the  objects  in  the  system  (e.g.,  core  frames,  pages  of 
files,  processes ,  message  channels,  etc.). 


Reliable  computation  is  achieved  by  the  provision  of 
automatic  backup  of  processes  and/or  files  to  previously 
saved  snapshots,  when  errors  are  detected.  In  the  following 
sections  we  will  discuss  the  architectural  principles  used 


Crash  recovery  in  a  distributed  system  requires  that 
copies  of  objects  be  stored  at  more  than  one  site  in  order 
to  allow  for  continued  operation  in  the  face  of  site  and 


communication  link  outages.  The  existence  of  multiple 
copies  of  objects  calls  for  a  mechanism  or  protocol  for  the 
management  of  those  copies  in  a  distributed  environment. 
This  protocol  should  be  implemented  at  the  lowest  possible 
level  within  the  Recovery  Software  so  that  most  of  it  and 
also  all  of  the  user  software  is  able  to  refer  to  virtual 
objects  only,  without  any  concern  for  the  existence  of  mul¬ 
tiple  physical  realizations  of  them. 

A  protocol  to  manage  multiple  copies  of  objects  should 
address  the  following  issues: 

A-  Lacking  far-  synchronization  al  access .  A  set  of 
rules  through  which  processes  can  acquire  and 
release  control  over  an  object  must  be  specified. 
The  basic  alternative  solutions  to  this  problem 
are : 


g.l  Centralized  Control .  Access  requests  are  sent 
to  a  centralized  controller  for  the  object 
which  is  responsible  for  making  decisions  as 
to  whether  access  to  the  object  should  be 
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granted  or  denied.  This  approach  is  clearly 


unacceptable  for  a  distributed  system  since  a 
crash  of  the  centralized  controller  would 
bring  the  whole  system  to  halt.  This  problem 
is  solved  by  the  next  alternative. 

Adaptive  Centralized  Control .  In  the  face  of 
a  crash  (actual  or  apparent)  of  the  central¬ 
ized  controller  a  new  one  should  take  its 
place.  Therefore  we  need  a  mechanism  for  the 
election  of  a  new  controller.  Moreover,  due 
to  network  partitioning  one  may  end  up  having 
one  controller  per  partition,  which  requires 
the  existence  of  a  mechanism  to  merge  parti¬ 
tions  upon  network  reconnection.  The  struc¬ 
ture  of  such  a  protocol  would  almost  certainly 
be  too  complex  for  the  purpose  of  locking  mul¬ 
tiple  copies  of  objects  in  a  distributed  sys¬ 
tem.  An  example  of  an  adaptive  centralized 
protocol  for  the  more  general  case  of  locking 
in  distributed  databases  was  presented  in  sec¬ 
tion  3.2  of  chapter  3* 

Distributed  Control .  The  set  of  sites  which 
store  a  copy  of  the  object  in  question  should 
agree  globally  before  any  one  of  them  can  be 
given  control  over  the  object.  This  approach 


of  the  system  in  the  face  of  partitions,  the 
majority  protocol  could  be  applied  on  a  per 
object  basis  . 


£.  Mutual  Consistency .  A  set  of  copies  of  an  object 
is  said  to  be  mutually  consistent  if  all  of  them 
converge  to  the  same  value  after  all  activity 
ceases.  Alternatively,  one  could  guarantee  that 
all  of  the  sites  know  the  name  and  location  of  the 
most  current  copy  of  the  object.  Therefore,  each 
site  should  be  able  to  detect  the  fact  that  it  has 
an  outdated  copy  of  the  object  and  request  of  the 
appropriate  site  that  the  current  version  be  made 
available . 

The  protocol  to  be  utilized  in  our  architecture  has 
distributed  control,  uses  a  weighted  majority  rule  for  con¬ 
tinued  operation  in  the  face  of  network  partitioning  and 
handles  mutual  consistency  by  guaranteeing  that  every  site 
is  able  to  know  the  name  and  location  of  the  most  recent 
copy  of  a  given  object. 

Zl-2-1  -  Error,  Confinement 

Systems  designed  with  reliability  in  mind  should  pro¬ 
vide  for  fire  walls  against  widespread  damages.  In  other 
words,  errors  when  they  occur  should  be  confined  to  the 
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least  possible  number  of  objects.  Randell  introduced,  in 
[RAND  75],  a  device  called  a  conversation .  which  is  a  gen¬ 
eralization  of  the  notion  of  a  transaction  in  database  sys¬ 
tems  [ESWA  76,  LAMP  76].  Any  set  of  objects  which  want  to 
interact  should  agree  to  enter  into  a  conversation.  Objects 
within  a  conversation  may  interact  freely,  but  may  not  in¬ 
teract  with  any  other  objects  outside  it.  If  a  failure  is 
detected  in  any  object  of  a  conversation,  the  whole  conver¬ 
sation  would  be  backed  up  to  a  recovery  point  taken  at  the 
entry  point  of  the  conversation.  In  this  sense,  a  conversa¬ 
tion  is  an  atomic  action.  This  implies  that  the  outcome  of 
the  computation  carried  out  by  processes  inside  a  conversa¬ 
tion  cannot  be  seen  by  processes  outside  it  otherwise  the 
conversation  would  not  be  atomic  anymore.  In  other  words, 
there  cannot  exist  any  information  flow  across  conversation 
boundaries  . 

There  are  clearly  two  advantages  in  having  conversa¬ 
tions  ,  namely : 

1 .  they  provide  a  mechanism  for  error  confinement  -  a 
necessary  requirement  for  any  reliable  system  ar¬ 
chitecture. 

2.  they  provide  a  mechanism  for  error  recovery  -  i.e. 
they  are  the  units  of  recovery. 
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Since  one  of  the  requirements  of  the  DSSB  is  security, 
information  flow  can  only  occur  between  two  objects  if  the 
protection  policy  allows  the  flow  to  occur.  An  information 
flow  model  like  Denning's  lattice  model  [DENN  76]  or  a  model 
based  on  colors  and  profiles  [POPE  77]  is  necessary  to  es¬ 
tablish  secure  information  flow  if  processes  cannot  be 
trusted.  Notice  that  the  protection  mechanism  establishes 
the  necessary  fire  walls  to  avoid  widespread  damages  caused 
by  errors .  If  the  color  model  is  used  one  can  change  the 
users  profile  to  limit  his  access  rights  temporarily.  This 
would  also  limit  the  set  of  objects  that  would  potentially 
be  affected  by  a  failure.  It  is  important  to  note  also  that 
even  if  security  were  not  an  issue  we  would  still  need  a 
mechanism  to  enforce  the  property  that  information  flow  do 
not  cross  conversation  boundaries. 


Note  that  requiring  that  processes  outside  a  conversa¬ 
tion  do  not  see  the  result  of  the  computation  of  a  conversa¬ 
tion  until  it  completes  successfully  is  likely  to  pose  a 
serious  performance  degradation  both  during  normal  operation 
-  through  decreased  concurrency-  and  during  crash  recovery  - 
due  to  the  necessity  of  complete  backup  of  possibly  long 
computations.  In  short,  for  medium  and  long  duration 
conversations  it  is  not  efficient  to  commit  the  results  of  a 
computation  only  after  it  has  successfully  completed.  This 
is  an  acceptable  solution  for  the  case  of  database  manage¬ 
ment  systems  where  the  transactions  are  short  (most  of  them 
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The  crash  recovery  strategy  implemented  in  the  distri¬ 
buted  system  is  based  in  backward  error  recovery.  This  im¬ 
plies  that  results  are  committed  when  generated  and  if  er¬ 
rors  are  detected  one  has  to  find  out  the  set  of  processes 
which  used  these  results,  back  them  up  along  with  the 
results  they  produced  and  so  on  and  so  forth.  The  net 
result  of  using  this  strategy  is  that  we  end  up  having  a 
finer  grain  of  recoverability. 

Since  the  issue  of  error  confinement  is  taken  care  by 
implementing  secure  information  flow  mechanisms  and  since  we 
do  not  want  to  have  a  conversation  as  an  atomic  action  be¬ 


cause  of  efficiency  considerations  it  is  not  necessary  to 
implement  conversations  in  the  DSSB . 


that  the  calculation  of  a  crash  set  generates  a  globally  in¬ 


consistent  state,  since  one  or  more  interactions  between  ob¬ 


jects  could  have  been  ignored 


On  the  other  hand,  the  only  problem  associated  with  the 


loss  of  one  or  more  historical  versions  for  an  object  is  a 


possible  performance  degradation  since  more  backing  up  may 


be  necessary  to  recover  from  a  crash.  Therefore,  historical 


versions  of  objects  will  be  stored  and  retrieved  as  ordinary 


files  supported  by  the  File  Policy  Manager  (FPM),  to  be 


described  later  in  this  chapter. 


Note  that  interaction  edges  do  not  need  to  be  recorded 


as  such  once  their  effect  in  the  global  condensed  graph  has 


already  been  taken  into  account.  This  fact  along  with  the 


fact  that  historical  versions  themselves  may  be  lost  without 


compromising  correct  error  recovery,  shows  that  the  portion 


of  the  recovery  data  which  is  critical  for  correct  error 


recovery  is  contained  in  the  condensed  history  graph.  A 


mechanism  for  reliably  storing  and  updating  this  graph,  us¬ 


ing  a  stable  storage  notion  [LAMP  76]  is  implemented  at  each 


site  . 


£.2.5.  - 


from 


The  complete  mapping  which  indicates  for  each  object 


which  sites  have  a  copy  of  the  object  and  what  is  the  actual 


physical  location  of  the  object  within  each  site  is  decom¬ 
posed  into  security  relevant  local  mappings  and  non-security 
relevant  mappings  to  other  sites.  The  former  is  a  local 
mapping  which  indicates  the  actual  physical  location  of  each 
local  object  only.  The  latter  indicates  to  each  site  what 
the  other  sites  are  which  have  a  copy  of  the  object,  giving 
no  indication  of  the  actual  physical  location  at  those 
sites.  Having  each  site  know  about  the  actual  physical  map¬ 
pings  at  each  other  site  would  involve  a  considerable  per¬ 
formance  overhead  for  updating  this  information. 

Figure  6.1  illustrates  the  decomposition  mentioned 
above.  In  particular,  figure  6.1. a  shows  the  complete  map¬ 
ping  for  an  object  x,  for  which  there  are  copies  at  sites 
SI,  S2,  ....  Sk.  Figure  6.1.b  shows  the  decomposed  mapping. 

L-Z-L  -  Sepana-tiqn  ill  Mechanisms  from  PbUc.y 

One  would  like  to  be  able  to  verify  the  properties  of 
consistency  and  correct  recovery  for  the  Recovery  Software 
and  for  this  purpose  the  architecture  of  the  DSSB  minimizes 
the  dependency  of  the  recovery  process  on  the  rest  of  the 
software.  As  an  application  of  this  principle,  this  design 
entirely  separates  recovery  policies  from  recovery  mechan¬ 
isms.  Recovery  policy  issues  such  as  frequency  of 
snapshots,  number  of  copies  to  be  kept  of  any  object,  and 


the  like  can  be  dealt  with  in  as  sophisticated  a  manner  as 
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Figure  6.1  -  Complete  Object-Location  Map 
ping  (fig.  6.1. a)  and  Decomposed  Mapping 
(fig.  6 . 1 .b) . 


the  user  desires.  iheie  policies  are  implemented  by  modules 
which  do  not  belong  to  the  Recovery  Manager,  the  part  of 
the  recovery  software  responsible  for  carrying  out  a  correct 
crash  recovery  operation.  The  correct  operation  of  modules 
that  supply  data  used  by  the  Recovery  Manager  is  similarly 
not  affected  by  policy  software. 

&-1  -  Ma.lor  System  Modules 

The  Distributed  Secure  System  Base  is  composed  of  four 
major  modules: 

a.  the  Base  Kernel  (BK). 

b.  the  File  Policy  Manager  (FPM). 

c.  the  Multiple  Copy  Manager  (MCM). 

d.  the  Recovery  Manager  (RM). 

The  Base  Kernel  is  a  slightly  modified  version  of  the 
UCLA  Data  Secure  Unix  Kernel  [KAMP  77].  It  can  be  viewed  as 
providing  an  abstract  machine  composed  of  several  types  of 
objects:  processes,  units  of  storage  called  segments,  dev¬ 


ices,  message  channels  ;  and  operations  valid  on  these  ob¬ 
jects,  which  are  subject  to  user  settable  protection  con- 


The  File  Policy  Manager  (FPM)  implements  security  poli¬ 
cy  controls  with  the  use  of  a  secure  information  flow  model. 
In  addition  to  that,  the  FPM  along  with  the  Multiple  Copy 
Manager  and  the  Base  Kernel  implements  a  distributed  file 
system.  Interaction  edges  which  involve  files  (i.e.  open¬ 
ings  and  closing  of  files)  are  sensed  by  the  FPM  and  report¬ 
ed  to  the  Recovery  Manager. 

The  Multiple  Copy  Manager  (MCM)  basically  implements 
the  majority  consensus  protocol,  mentioned  before,  used  to 
coordinate  access  to  the  multiple  existing  copies  of  files. 
The  Recovery  Manager  (RM)  essentially  implements  the  crash 
recovery  protocols  introduced  in  Chapter  5. 

Figure  6.2  illustrates  the  layered  architecture  of  the 
DSSB.  The  understanding  of  the  layering  is  that  a  module  at 
any  given  layer  only  needs  the  operations  provided  by  the 
layers  at  inner  levels  for  it  to  work  correctly.  Each  of 
the  component  modules  of  the  DSSB  will  be  discussed  in  the 
remainder  of  this  chapter. 

4-2-1  -  Sage  Kernel  (M) 

We  will  only  discuss  here  the  changes  to  the  Base  Ker¬ 
nel  which  were  introduced  to  support  the  distributed  system. 


These  modifications  fall  into  three  categories: 


MULTIPLE  COPY  MANAGER 


FILE  POLICY  MANAGER 

RECOVERY  MANAGER 


USER  CODE 


available  before  the  desired  page  is  requested  to  the 


foreign  Base  Kernel.  Remote  kernels  refer  to  the  page  by 


its  virtual  name  which  is  net  wide  unique  and  3ite  indepen¬ 


dent  . 


When  a  kernel  receives  a  SHIP  IN  request  from  a  foreign 


kernel  it  checks  first  whether  the  requested  virtual  page  is 


core  resident.  If  this  is  the  case,  the  kernel  starts 


transferring  the  page  to  the  requesting  site.  If  not,  the 


SHIP  IN  request  is  passed  to  the  FPM  which  finds  the  disk 


location  of  the  page  and  issues  a  SHIP  OUT  call 


Upon  receipt  of  the  SHIP  OUT  call,  the  kernel  causes 


the  page  to  be  brought  to  core  and  transfers  the  page  to  the 


destination  (requesting)  site. 


L-l-l-Z  - 


The  architecture  of  the  Base  Kernel  is  capability 


based.  In  order  for  backward  error  recovery  to  take  place 


conveniently,  it  is  necessary  that  processes  be  able  to  mi¬ 


grate  and  still  refer  to  the  same  objects  in  a  site  indepen¬ 


dent  manner.  Therefore,  the  name  field  in  the  capability 


has  to  be  a  virtual  name.  Other  fields  in  the  capability 


a.  local  disk  location:  for  pages  only. 


•  V»\ 


'*.\v 


v.  *\  -\  <\ 

A  .  »  m  V.  *  '  V 


b.  core  location  guess  pointer:  for  pages  only  -  this 
field  is  a  guess  as  to  where  in  core  the  page  can 
be  found  . 

c .  access  rights  . 


£•2-1. 1  -  Network  Wide  Inter  Process  Communication 

The  Base  Kernel  due  to  its  functions  and  being  at  the 
lowest  level  in  the  system  hierarchy  must  be  aware  of  the 
existence  of  the  network  and  must  know  how  to  communicate 
with  objects  at  remote  sites  by  using  the  network  explicit¬ 
ly.  As  a  consequence,  the  set  of  Base  Kernels  are  in  a  good 
position  to  make  the  network  transparent  as  far  as 
processes  are  concerned.  In  other  words,  the  set  of  BKs  im¬ 
plement  a  network  wide  inter  process  communication  mechanism 
which  along  with  a  network  independent  naming  scheme  for 
processes  makes  inter  process  communication  the  same,  be  it 
local  or  remote. 

4-2-2  -  Multifile  Copy  Mana&g-c  (MCM) 

The  MCM  is  primarily  responsible  for  implementing  a 
protocol  which  can  be  used  to  coordinate  access  to  multiple 
copies  of  files.  An  intuitive  description  of  the  majority 
consensus  protocol  i3  given  in  the  remainder  of  this  subsec¬ 


tion  . 


The  multiple  copy  management  protocol  implemented  by 
the  MCM  integrates  all  the  aspects  of  the  problem  discussed 
in  section  6.2.2  of  this  chapter.  Control  is  distributed  in 
the  sense  that  the  majority  of  sites  which  keep  a  copy  of  a 
given  file  has  to  be  involved  in  the  process.  We  will  as¬ 
sume  throughout  this  discussion  that  the  set  of  sites  that 
maintain  a  copy  of  a  given  file  is  static.  More  will  be 
said  about  this  later.  We  will  also  assume  the  existence  at 
each  site  of  a  process  called  a  Multiple  Copy  Manager  ( MCM ) 
which  actually  implements  this  protocol.  This  protocol  is 
homogeneous  in  the  sense  that  it  is  the  same  for  all  nodes. 
The  approach  taken  here  to  mutual  consistency  is  that  of 
making  sure  that  every  MCM  is  able  to  know  who  has  got  con¬ 
trol  over  a  file,  what  is  its  status  and  what  is  the  name 
of  its  most  current  copy. 

Each  site  keeps  a  data  structure  called  a  status 
descriptor  ( SD )  which  indicates  the  status  of  the  system  as 
viewed  by  that  site.  The  status  indicates  whether  the  file 
is  locked  or  not  and  in  the  affirmative  case,  in  which  mode 
it  is  locked,  and  also  the  name  of  its  most  current  copy. 
An  SD  is  a  4-tuple  of  the  form  <ts,  f n ,  mode,  tout>  where, 

1.  .L&  is  a  timestamp  whose  use  will  become  clear  as 
the  protocol  is  described. 

2.  Xn  is  the  file  name  which  is  a  two  part  name  of 
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the  form  <global  name> . <generation#>  . 

3.  mode  is  the  mode  in  which  the  file  is  locked.  It 
can  assume  one  of  the  four  possible  values:  un¬ 
locked  (U),  share  (SH)  and  exclusive  (X). 

4.  tout  is  an  estimate  of  the  time  interval  during 
which  the  file  will  be  in  use.  When  mode  =  U  then 
tout  is  set  to  - 1  . 

A  status  descriptor  with  mode  =  U  will  be  called  an 
unlocked  status  descriptor  as  opposed  to  a  locked  SD  for  the 
case  in  which  mode  i  U . 

For  the  purposes  of  this  explanation  it  is  necessary  to 
consider  only  one  file.  Let  it  be  called  F  and  let  S  =  {SI, 
S2,  ....  Sn}  be  the  set  of  sites  that  keep  a  copy  of  F.  As¬ 
sume  that  the  initial  state  of  the  system  is  such  that  all 
SDs  have  the  following  value  <t0,  F.O,  U,  -1>  indicating 
that  a  copy  of  F  with  generation  number  equal  to  zero  was 
created  at  all  sites  at  time  to  and  that  no  one  has  control 
over  F. 

Let  SI  be  a  site  who  wants  to  lock  file  F.  In  order  to 
determine  the  status  of  the  file  it  requests  from  all  sites 
in  S  that  their  current  SD  for  F  be  sent  to  SI.  After  SI 
has  collected  SDs  from  at  least  the  majority  of  sites  in  S 
(including  itself)  it  selects  the  most  current  status 
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descriptor,  i.e.,  the  one  with  the  biggest  timestamp.  As 
will  be  proved  later,  the  correct  status  of  a  file  can  be 
correctly  determined  by  the  most  current  SD  among  a  set  of 
SDs  collected  from  a  majority  of  sites.  Let  SDmax  be  this 
most  current  status  descriptor. 

If  SDmax  is  an  unlocked  status  descriptor  or  if  it  is 
locked  in  a  non-conflicting  mode,  then  SI  is  allowed  to  lock 
F.  It  does  so  by  successfully  broadcasting  to  at  least  the 
majority  of  sites  an  appropriately  timestamped  SD  which  in¬ 
dicates  that  SI  is  the  locking  site.  An  estimate  of  the 
time  interval  during  which  Si  will  need  F  is  included  in  the 
timeout  component  of  this  SD. 

If  SDmax  is  locked  in  a  conflicting  mode  then  SI  has  to 
wait  and  retry  later.  The  timeout  component  of  SD 1  can  be 
used  as  a  guess  to  when  SI  should  retry. 

A  site  releases  a  lock  on  the  file  F  by  reliably  broad¬ 
casting  to  at  least  the  majority  of  sites  a  release  request. 
A  protocol  similar  to  the  Assured  Communication  Protocol 
(see  section  3-2.2  of  chapter  3)  is  used  to  provide  the 
needed  robustness  to  the  release  control  protocol. 

The  period  during  which  F  will  be  used  by  SI  can  be 
renewed  by  sending  an  appropriately  timestamped  renewal  re¬ 
quest  and  having  it  being  approved  by  at  least  the  majority 
of  sites.  If  SI  fails  to  get  a  majority  for  its  renewal  re- 


quest  it  must  release  control  over  the  file.  If  the  file 
was  locked  in  exclusive  mode  by  SI,  then  SI  must  backup  the 
updates  it  did  on  F.  When  a  site  times  out  for  its  current 
SD  on  F,  it  will  discard  it  and  replace  it  by  the  previous 
SD .  This  mechanism  allows  locks  on  files  to  be  released 
from  a  set  of  sites  which  do  not  constitute  a  majority, 
thereby  allowing  the  majority  to  use  it. 

Only  one  site  should  be  able  to  lock  a  file  in  ex¬ 
clusive  mode.  In  order  to  enforce  this  property  it  is  suf¬ 
ficient  that  a  MCM  only  sends  an  unlocked  status  descriptor 
as  a  reply  for  a  lock  request  to  one  and  only  one  MCM.  If 
an  unlocked  SD  had  already  been  sent  to  another  site  when  a 
request  for  locking  the  file  in  exclusive  mode  is  received, 
a  message  is  sent  to  the  requesting  site  that  someone  is  al¬ 
ready  trying  to  lock  the  file  in  exclusive  mode.  But  now, 
we  have  a  potential  for  deadlocks  since  several  sites  may 
collect  less  than  a  majority  of  unlocked  SDs  each.  There¬ 
fore  none  of  them  would  be  granted  exclusive  access  to  the 
file  and  they  would  not  be  able  to  proceed.  The  solution  to 
this  problem  is  to  have  each  site  retry  again  after  a  ran¬ 
domly  chosen  delay. 

As  we  asserted  before,  the  examination  of  the  most 
current  status  descriptor,  i.e.,  the  one  with  the  biggest 
timestamp,  determines  correctly  the  status  of  a  file.  Let 
us  now  prove  this  assertion. 


ASSERTION  £.J_:  Let  S  =  {SI,  S2,  ...,Sn}  be  a  set  of  sites  at 
which  there  are  status  descriptors  for  a  file  F.  The  SD 
with  the  biggest  timestamp  selected  from  a  subset  X  of  S 
containing  a  majority  of  the  sites  in  S  reflects  the  current 
status  of  the  file. 

Proof:  In  the  MCM  protocol  described  above,  every  time  that 
the  status  of  a  file  F  is  changed  (i.e.  the  file  is  locked 
or  unlocked), a  subset  of  S  containing  at  least  a  majority 
of  sites  has  to  be  notified.  Let  Y  be  the  subset  of  S  which 
was  involved  in  the  operation  which  resulted  in  the  last 
change  to  the  status  of  the  file.  Therefore,  Y  contains  at 
least  a  majority  of  the  sites  in  S  and  the  following  ine¬ 
quality  holds: 

!Y{  2  CISI/2]  +  1  (6.1) 

where  [x]  is  a  function  which  gives  the  greatest  integer 
less  than  x.  The  SDs  in  all  sites  on  Y  are  the  same  and 
they  reflect  the  current  status  of  the  file  F.  Since  the 
set  X  also  contains  a  majority  of  the  sites  in  S  the  follow¬ 
ing  inequality  is  true: 

! X !  2  [ ! S i / 2 ]  +  1  .  (6.2) 

While  X  and  Y  may  be  different,  in  general,  they  must 


have  at  least  one  element  in  common.  Assume  not.  Then,  it 


1  x  i  +  !  y  !  <  !  s  I 


But  if  we  sum  (6.1)  with  (6.2)  we  get 


X!  +  { Y i  >  2*[!S|/2]  +  2  2 

2  !S|  -  1  +  2  >  ] S ! 


(6.3) 


(6.4) 


which  contradicts  (6.3)  and  proves  that  X  and  Y  are  not  dis¬ 
joint.  The  assertion  is  then  proved  since  the  SDs  in  Y  have 
the  biggest  timestamp  and  therefore  reflect  the  current 
status  of  the  file  F  and  since  X  Y  i  0 . 


Some  observations  about  the  above  described  protocol 
are  in  order.  First,  it  is  not  necessary  for  all  the  sites 
where  a  copy  of  the  file  exists  to  be  involved  in  the  deci¬ 
sion  procedure.  In  particular,  one  can  designate  a  subset 
of  the  set  of  sites  which  have  a  copy  of  the  file  -  let  us 
call  them  voting  sites  -  as  being  those  which  are  involved 
in  the  protocol  and  with  respect  to  which  a  majority  is  con¬ 
sidered.  This  approach  cuts  down  on  the  communications  cost 
incurred  by  the  protocol. 


A  second  observation  is  that  the  majority  rule  solves 
the  problem  of  network  partitioning  by  allowing  the  parti¬ 
tion  with  a  majority  of  voting  sites  for  a  given  file  to 
continue  using  the  file.  This  rule  may  be  made  more  flexi¬ 
ble  by  assigning  weights  to  each  voting  site  and  by  consid¬ 
ering  a  weighted  majority.  As  an  example,  one  may  want  to 
be  able  to  continue  to  access  a  local  copy  of  a  file  even  if 
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ones  site  becomes  isolated  from  the  rest  of  the  network  be¬ 
cause  of  line  failures.  This  could  be  accomplished  by  as¬ 
signing  a  high  enough  weight  to  ones  copy  so  that  ones  site 
itself  would  be  enough  to  achieve  a  weighted  majority.  Many 
other  policies  can  be  implemented  by  a  proper  weight  assign¬ 
ment  to  voting  sites  of  a  given  site. 

£•2-2  -  File  E&lilc-S  (FEM) 

The  File  Policy  Manager  (FPM)  is  primarily  responsible 
for  implementing  a  distributed  file  system  and  also  for  im¬ 
plementing  a  secure  information  flow  model.  The  system 
software  above  the  FPM  as  well  as  all  of  the  user  software 
should  be  able  to  refer  to  files  via  virtual  file  names . 
which  are  netwide  unique  and  location  independent. 

A  virtual  file  or  simply  a  file  can  be  considered  as  a 
collection  of  virtual  pages  (the  data  pages)  plus  some  map¬ 
pings  which  define  the  file.  One  of  these  mappings  speci¬ 
fies  the  set  of  virtual  pages  which  compose  a  given  file. 
This  mapping  is  called  the  File  to  Virtual  Page  Mapping  or  F 
->  VP  mapping.  Before  a  file  can  be  accessed  at  a  given 
site,  the  F  ->  VP  mapping  for  the  file  mu3t  be  made  avail¬ 
able  at  the  site.  This  operation  is  functionally  equivalent 
to  the  operation  of  opening  a  file  in  centralized  file  sys¬ 
tems.  The  MCM  is  used  to  lock  the  file  in  the  desired  mode. 
Once  this  is  done,  the  FPM  is  able  to  request  that  the  F  -> 


VP  mapping  be  brought  from  a  remote  site.  At  this  point  the 
file  is  said  to  be  known  at  the  site.  It  may  be  the  case 
that  none  of  the  pages  of  the  file  are  present  locally  when 
the  user  starts  using  it. 

For  those  pages  which  are  locally  present  and  only  for 
them,  the  FPM  keeps  a  mapping  which  indicates  the  physical 
disk  location  associated  to  the  virtual  page.  This  Mapping 
is  called  the  Virtual  Page  to  Physical  Page  mapping  or  VP 
->  PP  mapping. 

When  a  page  which  is  not  present  locally  is  referenced 
by  a  user,  the  page  must  be  brought  over  from  a  foreign 
site.  This  is  accomplished  via  the  Network  Wide  Paee 
Faulting  (NWPF)  mechanism. 

Network  wide  page  faulting  is  implemented  by  the  set  of 
FPMs  with  the  aid  of  the  set  of  Base  Kernels.  When  a  user 
faults  for  a  page  which  is  not  local,  the  FPM  issues  a  SHIP 
IN  call  to  the  local  Base  Kernel  requesting  it  to  bring  the 
desired  VP  from  a  site  in  the  File  to  Site  Mapping  kept  by 
the  MCM .  If  the  page  cannot  be  found  at  the  specified  site, 
the  FPM  must  choose  another  one  and  retry  the  operation  un¬ 
til  either  the  page  is  found  or  until  the  list  of  sites  is 
exhausted.  In  the  latter  case,  the  page  is  declared  to  be 
lost  and  the  file  is  declared  to  be  in  error.  This  error  is 
reported  to  the  Recovery  Manager  which  will  start  a  crash 


recovery  operation. 


Figure  6.3  shows  the  possible  interactions  between  the 
FPM ,  MCM  and  BK.  The  picture  also  shows  the  operations  as¬ 


sociated  with  each  such  interactions. 

In  addition  to  the  above  described  functions,  the  FPM 
must  perform  some  operations  which  are  required  by  the 
Recovery  Manager.  These  functions  include  the  sensing  of 
interaction  edges  which  involve  files  and  the  reporting  of 
these  edges  to  the  Recovery  Manager,  as  well  as  the  taking 
of  historical  versions  of  objects  and  the  restoration  of  ob¬ 
jects  to  previously  generated  historical  versions  when 
demanded  by  the  Recovery  Manager.  These  issues  will  be 
covered  in  the  following  section  when  the  internals  of  the 
Recovery  Manager  will  be  discussed. 

£.1.1  -  RgaavsrY  Manajur 

This  system  module  is  responsible  for  implementing  the 
error  recovery  facilities  needed  to  support  reliable  distri¬ 
buted  computing.  These  facilities  are  based  on  the  crash 
recovery  model  developed  in  chapter  4.  Consequently,  the 
core  of  the  Recovery  Manager  is  devoted  to  the  maintenance 
of  a  local  subgraph  of  the  global  condensed  history  graph  Gc 
and  in  carrying  out  the  crash  set  calculation  algorithms 
developed  in  chapter  5.  In  order  to  perform  the  above  func¬ 
tions,  the  Recovery  Manager  needs  to  interact  with  the  FPM 
and  with  the  BK,  as  well  as  with  other  RMs .  These  interac- 


(1)  interaction  with  other  MCMs  via  net¬ 
workwide  IPC  facility. 

(2)  locking  of  files. 

(3)  issue  of  SHIP  IN  and  SHIP  OUT  calls; 
interaction  with  other  FPMs  via  net¬ 
workwide  IPC  facility. 


Figure  6.3  -  Possible  Interactions  Between 
the  FPM ,  MCM  and  the  BK. 
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tions  are  in  the  form  of  messages,  the  description  of  which 
is  given  in  what  follows. 


Whenever  a  recovery  policy  module  -  which  is  not  part 
of  the  DSSB  -  dictates  that  an  historical  version  of  an  ob¬ 
ject  should  be  taken,  the  RM  is  charged  with  the  task  of  in¬ 
itiating  this  action.  In  particular,  if  an  historical  ver¬ 
sion  of  a  file  must  be  generated,  then  the  following  message 
is  sent  from  the  RM  to  the  FPM. 


message  name :  take_file_hv 

message  arguments :  (filename,  timestamp) 

message  description :  the  Recovery  Manager  requests  that  the 
FPM  take  an  historical  version  of  the  file  which  has  virtual 
name  filename .  The  historical  version  will  be  a  file  itself 


and  the  name  of  this  file  is 


where 


timestamp  is  the  clock.  Additionally,  the  RM  updates  its 
local  subgraph  to  reflect  the  generation  of  the  new  histori¬ 
cal  version  in  the  way  described  in  section  5.3-1  of  chapter 


Two  basic  alternatives  for  taking  historical  versions 
of  processes  can  be  be  considered.  The  first  one  consists 
in  having  a  distinguished  process  charged  with  the  task  of 
taking  historical  versions  of  all  the  processes.  This  pro¬ 
cess  should  be  trusted  otherwise  security  could  be  comprom¬ 
ised.  Alternatively,  each  process  could  take  an  historical 
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version  of  itself.  This  is  the  approach  taken  in  the  DSSB. 
The  following  message  is  sent  from  the  RM  to  the  process  in 
question . 


message  name :  take_proc_hv 

message  argument :  ( processname ,  timestamp) 

message  description :  likewise  the  previous  message,  the  RM 
requests  that  the  process  called  processname  generates  an 
historical  version  of  itself  and  stores  it  in  a  file  called 
processname. timestamp  where  timestamp  is  as  before.  As  in 
the  previous  message,  the  condensed  history  graph  is  updated 
to  reflect  the  new  historical  version. 

Interaction  edges  can  be  detected  at  two  points  in  the 
DSSB:  in  the  FPM  (for  file  operations)  and  in  the  BK  (for 

inter  process  communication).  These  interaction  edges  have 
to  be  reported  to  the  RM  for  the  purpose  of  updating  the 
condensed  graph.  For  efficiency  reasons,  these  reports  may 
contain  a  batch  of  interaction  edges  instead  of  a  single 
one.  For  the  case  of  inter  process  communication  between 
remote  processes,  each  of  the  BKs  involved  will  sense  it  and 
report  the  interaction  edge  to  their  local  Recovery 
Managers  . 

The  kind  of  file  operations  that  we  are  considering  for 
the  purposes  of  sensing  and  recording  interaction  edges  are 


open  and  close  operations.  Let  us  examine  how  these  in- 


teraction  edges  are  applied  to  the  appropriate  local  sub¬ 
graphs.  Let  P  be  a  process  requesting  an  operation  on  file 
F.  As  we  mentioned  in  chapter  5,  the  subgraph  SG(P)  associ¬ 
ated  with  process  P  is  located  at  the  site  where  P  is  run¬ 
ning.  The  file  operation  requested  by  P  is  sensed  by  the 
local  FPM  which  passes  it  to  the  local  RM.  The  Recovery 
Manager  then  adds  the  interaction  edge  to  SG(P)  as  described 
in  section  5.3.2  of  the  previous  chapter.  We  also  require 
that  the  FPM  local  to  the  site  where  SG(F)  is  currently  lo¬ 
cated  be  involved  in  the  operation.  The  most  natural  thing 
to  do  is  to  acquire  the  F  ->  VP  mapping  from  location(F) 
-  the  site  where  SG(F)  is  currently  located.  Now,  by  the 
assumption  made  in  chapter  5,  SG(F)  is  located  at  the  site 
where  F  is  currently  being  accessed  for  update  or  where  it 
was  last  used  for  update.  Therefore,  the  F  ->  VP  mapping  at 
location(F)  is  the  most  current  copy  of  it.  The  FPM  at 
location(F)  senses  the  operation  and  passes  the  correspond¬ 
ing  interaction  edge  to  its  local  RM. 


In  any  event,  the  RM  will  receive  reports  of  interac¬ 
tion  edges  from  its  local  BK  and  FPM.  These  reports  come  in 
the  message  described  below. 
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message  arguments :  (report,  message_s eq_# ) 

message  description :  this  message  contains  a  batch  of  in¬ 
teraction  edges.  This  batch,  called  report .  has  a 
message_seq_#  assigned  to  it.  This  number  is  used  to  iden¬ 
tify  the  acknowledgement  message  which  must  be  sent  by  the 
RM  to  the  sender  of  the  ie_report. 

The  ie-report  message  must  be  acknowledged  to  it3 
sender  so  that  it  can  discard  the  report.  This  is  done  with 
the  message: 

message  name :  ie_report_ack 
message  argument:  ( message_seq_# ) 

message  description :  upon  receipt  of  this  message,  the  in¬ 
teraction  edge  report  associated  with  message_seq_#  is  dis¬ 
carded  . 

Another  operation  on  the  graph  is  discarding  of  histor¬ 
ical  versions.  As  mentioned  earlier,  historical  versions  of 
files  and  processes  are  files  like  any  other  file  as  far  as 
the  FPM  is  concerned.  When  the  RM  decides  that  a  given  his¬ 
torical  version  must  be  discarded  ,  it  applies  the  appropri¬ 
ate  collapsing  operation  in  the  graph  (as  described  in  sec¬ 
tion  5.3.3  of  chapter  5)  and  requests  the  FPM  to  delete  the 
file  which  contains  the  historical  version.  This  is  done 


via  the  message  from  the  RM  to  the  FPM. 


IfFI"! 


message  argument :  (hv_name) 

message  description :  upon  receipt  of  this  message, the  FPM 
will  delete  the  file  hv_name  which  contains  an  historical 
version  of  an  object. 

When  a  crash  is  detected  the  RM  at  the  site  where  the 
error  is  detected  will  start  the  crash  set  calculation 
described  in  the  previous  chapter.  The  RMs  at  foreign  sites 
will  cooperate,  when  necessary  by  the  protocol,  in  perform¬ 
ing  the  crash  set  calculation. 

Once  the  crash  set  calculation  is  over,  the  objects  in 
the  crash  set  must  be  restored  to  the  historical  versions 
indicated  in  the  crash  set.  The  restoration  itself  is  done 
by  the  FPM  upon  request  of  the  RM.  The  request  has  the  form 
of  the  following  message. 

message  name :  restore_ob j ec t 

message  arguments:  ( ob j ect_type ,  hv_name,  message_seq_# ) 
message  description :  upon  receipt  of  this  message  the  FPM 
restores  the  object  of  the  type  specified  by  object_type 
(i.e.  file  or  process)  to  its  historical  version  in  the  file 
hv_name.  The  message_seq_#  is  used  to  associate  this  mes¬ 
sage  with  a  reply  to  it  containing  the  outcome  of  the  res¬ 


toration  operation. 


The  Recovery  Manager  is  notified  of  the  result  of  the 


restoration  operation  via  the  following  message. 

message  name :  restoration_outcome 
message  argument :  (outcome,  message_seq_# ) 

message  description:  the  outcome  of  the  object  restoration 
operation  associated  with  message_seq_#  is  sent  back  to  the 
RM.  This  outcome  may  either  be:  successful  or  unsuccessful 
in  which  case  the  specified  historical  version  was  not 
found . 

Notice  that  in  most  of  the  operations  just  described, 
the  RM  interacts  with  its  local  FPM  irrespective  of  the  lo¬ 
cation  of  the  desired  file.  This  is  possible  because  the 
FPMs  implement  a  distributed  file  system  which  makes  the  lo¬ 
cation  of  a  file  transparent  to  software  above  the  FPM. 

i-1  -  CflBQiualaa 

Large  and  important  classes  of  applications  of  distri¬ 
buted  computing  require  the  existence  of  a  secure  and  reli¬ 
able  base.  Office  automation  systems  and  database  manage¬ 
ment  systems  are  examples  of  these  classes  of  systems.  The 
UCLA  Distributed  System  Base  (DSSB)  described  in  this 
chapter  provides  the  required  base  for  such  applications. 

Six  important  architectural  principles  were  singled  out 
here,  namely:  network  independent  object  names,  the  ex- 
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istence  of  multiple  copies  of  objects,  error  confinement, 
minimization  of  the  data  critical  for  correct  error 
recovery,  separation  of  security  from  recovery  relevant  map¬ 
pings  and  separation  of  mechanisms  from  policy  issues. 

The  actual  system  architecture,  which  is  composed  of 
four  major  modules  was  described  in  this  chapter.  The  ar¬ 
chitecture  is  layered  in  such  a  way  that  one  level  only 
needs  the  operations  provided  by  the  levels  below  for  its 
correct  operation.  The  set  of  Base  Kernels  implements  a 
distributed  process  name  space  along  with  a  network  wide 
inter  process  communication  mechanism.  At  the  next  level, 
we  find  the  Multiple  Copy  Managers  which  implement  the  lock¬ 
ing  protocol  necessary  to  coordinate  access  to  multiple 
copies  of  files.  At  the  next  level,  the  File  Policy 
Managers  implement  a  network  wide  file  name  space  and  a  dis¬ 
tributed  file  system.  The  FPMs  do  not  have  to  be  concerned 
with  the  existence  of  multiple  physical  realizations  of 
files  since  this  issue  is  tacen  care  by  the  MCMs .  Finally, 
at  the  outermost  level  we  find  the  Recovery  Manager  which  is 
in  charge  of  carrying  out  crash  recovery  operations. 


CHAPTER  7 


I  CONCLUSIONS  AND 

SUGGESTIONS  FOR  FUTURE  RESEARCH 

l'±  -  Introduction 

In  this  dissertation  we  made  contributions  in  two 
areas,  namely  the  areas  of  distributed  systems  and  that  of 
distributed  databases.  Besides  summarizing  these  contribu¬ 
tions,  this  chapter  presents  some  suggestions  for  future 
research . 

■ 

* 

[  1-Z  -  Contributions  _£_&  the  Area  slL  Distributed  Systems 

I 

|  The  design  of  distributed  systems  involves  important 

|  issues  such  as: 

I 

J  (a)  interconnection  structures 

\ 

< 

» 

|  (b)  error  recovery 

(c)  network  operating  systems 
|  (d)  security 

!  (e)  program  and  data  assignment 
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In  this  dissertation  we  made  contributions  to  the  prob¬ 


lems  of  error  recovery  and  of  the  design  of  network  operat¬ 
ing  systems. 

1. 2.1  -  £man  Recovery 

A  formal  model  for  crash  recovery  in  computer  systems 
was  developed  here.  The  model,  which  is  in  the  form  of  a 
graph  called  the  condensed  history  graph ■  considers  the  ex¬ 
istence  of  a  set  of  snapshots  or  historical  versions  for 
each  recoverable  object  in  the  system  as  well  as  the  records 
of  flow  of  information  between  these  objects. 

This  graph  is  defined  in  such  a  way  that  the  following 
properties  hold: 

a.  any  cutset  of  the  graph  containing  branches  asso¬ 
ciated  with  historical  versions  of  distinct  ob¬ 
jects  is  a  recovery  set .  i.e.  a  set  of  historical 
versions  such  that  a  global  consistent  state  of 
the  system  is  obtained  if  state  restoration  is 
done  according  to  the  historical  versions  indicat¬ 
ed  by  the  recovery  set. 

b.  the  latest  possible  recovery  set,  called  a  crash 
set .  was  characterized  in  terms  of  a  specially  de¬ 
fined  cutset  of  this  graph.  This  cutset  can  be 
found  using  an  efficient  algorithm  which  is  linear 
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in  space  and  in  time. 


c.  all  the  historical  versions  associated  with 
branches  in  a  cycle  of  the  condensed  history  graph 
are  useless  because  they  cannot  participate  of  any 
crash  set.  Therefore  they  can  be  discarded. 

d.  records  of  information  flow  between  objects  do  not 
need  to  be  stored  anymore  for  crash  recovery  pur¬ 
poses  once  their  effect  has  been  taken  into  ac¬ 
count  in  the  graph.  This  fact  is  extremely  impor¬ 
tant  not  only  because  it  minimizes  the  storage  re¬ 
quirements  for  the  system  but  because  the  informa¬ 
tion  flow  pattern  is  crucial  for  correct  error 
recovery.  Therefore,  if  records  of  information 
flow  needed  to  be  stored,  extremely  reliable 
mechanisms  would  be  needed  for  this  purposes  . 

In  a  distributed  system  we  cannot  assume  anymore  the 
existence  of  a  single  complete  version  of  the  condensed  his¬ 
tory  graph.  Moreover,  we  cannot  assume  that  the  graph  is 
being  updated  continuously.  Instead,  the  graph  is  assumed 
to  be  partitioned  into  local  subgraphs  distributed  among  the 
several  sites  of  the  network.  A  protocol  to  update  these 
local  subgraphs  as  well  as  a  protocol  to  do  crash  set  calcu¬ 
lation  in  a  distributed  manner  was  developed  in  this  disser¬ 
tation.  The  latter  explores  all  the  possible  parallelism 


which  exists  due  to  the  fact  that  the  graph  is  partitioned 
and  that  there  are  several  sites  which  may  be  concurrently 
doing  their  part  in  the  crash  set  calculation. 

1-1.1  -  User  Interface 

The  user  interface  to  a  distributed  system  should  be 
one  in  which  the  distributed  nature  of  the  system  is  not  ap¬ 
parent.  Examples  of  such  interfaces  are  operating  systems, 
database  management  systems  and  office  automation  systems. 

At  any  rate,  the  task  of  designing  and  building  such  inter¬ 
faces  is  extremely  simplified  if  they  can  be  built  on  top  of 
a  distributed  system  base  which  makes  the  network  tran¬ 
sparent  . 

This  dissertation  contributed  to  the  architecture  of 
the  UCLA  Distributed  Secure  System  Base  (DSSB).  The  DSSB  is 
a  secure  and  reliable  system  base  which  utilizes  the  i 

recovery  protocols  mentioned  above.  The  DSSB  is  essentially 
composed  of  four  modules,  namely: 

1 .  The  BASE  KERNEL  which  implements  the  security  con¬ 
trol  mechanisms  and  also  implements  a  network  wide 
interprocess  communication  facility. 

2.  The  MULTIPLE  COPY  MANAGER  which  is  responsible  for 
coordinating  access  to  the  several  existing  copies 


of  the  same  file. 


3.  The  FILE  POLICY  MANAGER  which  implements  a  distri¬ 
buted  file  system. 

4.  The  RECOVERY  MANAGER  which  implements  the  crash 
recovery  facilities  of  the  system. 

The  following  architectural  principles  were  identified 
in  the  design  of  the  DSSB . 

a.  Network  Independence 

b.  Multiple  Copies 

c.  Error  Confinement 

d.  Minimization  of  Recovery  Relevant  Data 

e.  Separation  of  Security  Relevant  from  Recovery 
Relevant  Mappings 

f.  Separation  of  Mechanisms  from  Policy 

1-1  -  CailtEifeut  In  JJm  Area  nx  Distributed  Databases 

Major  issues  in  the  design  of  distributed  databases 
are:  data  integrity,  crash  recovery,  concurrency  control  and 
synchronization  protocols.  This  dissertation  presented  a 
locking  protocol  to  coordinate  access  to  a  distributed  data¬ 


base  and  to  maintain  the  database  consistency.  The  protocol 


is  robust  in  the  face  of  additional  failures.  Recovery  is 
done  in  such  a  way  that  a  maximum  forward  progress  is 
achieved  by  the  crash  recovery  procedures.  The  performance 
of  the  protocol  does  not  degrade  operations.  In  particular, 
the  delay  and  communications  co3t  incurred  by  an  update  re¬ 
quest  are  not  a  function  of  the  size  of  the  network. 

An  undesirable  side  effect  of  locking  is  the  problem  of 
deadlocks.  It  was  discussed  in  this  dissertation  that 
deadlock  detection  is  superior  than  any  other  method  for 
handling  deadlocks  in  a  distributed  database  environment. 
Deadlock  detection  requires  the  ability  to  detect  cycle  in  a 
"state  graph".  In  this  dissertation  we  presented  two  algo¬ 
rithms  to  detect  deadlocks  in  a  distributed  database  -  one 
is  hierarchically  organized  and  the  other  is  distributed. 
Neither  one  of  them  requires  that  the  whole  graph  be  built 
at  one  site  in  order  for  deadlocks  to  be  detected.  These 
protocols  are  far  superior  than  centralized  deadlock  detec¬ 
tion  approaches  when  the  communications  cost  is  a  function 
of  the  distance. 

I-JL  -  Sunft&nions  Xan  Future  Research 

The  design  of  distributed  systems  is  an  area  of 
research  in  its  early  stages.  Some  of  the  problems  are  un¬ 
derstood,  some  are  solved  and  some  still  have  to  be  solved. 


In  this  dissertation  we  covered  some  of  the  problems  and 


developed  formal  models  to  aid  in  solving  them. 


While  we 


treated  the  problems  of  crash  recovery  using  backward  error 
recovery  strategies,  we  did  not  address  the  problem  of  error 
detection.  This  problem  is  extremely  important  and  no  gen¬ 
eral  solution  to  it  has  been  found  so  far.  An  approach  to 
error  detection  at  the  programming  language  level  was 
presented  in  [ANDE  76].  This  method  suggests  that  more  than 
one  version  of  an  algorithm  be  independently  programmed. 
Each  version  is  called  an  alternate  block.  The  set  of  al¬ 
ternate  blocks  for  a  given  algorithm  is  called  a  recovery 
block.  Recovery  blocks  are  nested.  At  the  end  of  each  al¬ 
ternate  block  there  is  an  acceptance  test.  If  the  test 
succeeds  then  the  execution  of  the  associated  alternate 
block  is  considered  to  be  successful.  If  however  the  test 
fails  an  error  is  detected.  In  this  case,  the  system  is 
backed  up  to  its  state  at  the  beginning  of  the  alternate 
block  in  question.  The  next  alternate  block  for  the 
recovery  block  is  attempted  until  either  one  of  them 
succeeds  or  until  all  of  them  have  been  tried.  The  error 
detection  problem  is  very  important  and  deserves  further 
research  . 

As  for  further  research  in  the  area  of  distributed  da¬ 
tabases  we  believe  that  suitable  models  of  synchronization 
protocols  need  to  be  developed.  These  models  should  aid  in 
the  task  of  formally  verifying  the  properties  that  one  re¬ 
quires  of  these  protocols.  Formal  verification  clearly  re- 
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quires  a  precise  representation  of  a  protocol  where  its  pro- 
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perties  can  be  formally  stated.  Graph  models  of  parallel 
programs  are  a  natural  choice  for  protocol  specification. 
Some  of  these  models,  like  Petri  Nets,  limit  themselves  to 
modeling  the  flow  of  control  of  a  computation  and  therefore, 
may  strongly  impair  one's  capability  of  using  them  as  an  aid 
in  proving  properties,  like  consistency,  which  deal  with 
data  semantics. 
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Other  models,  like  Keller's  model  [KELL  76],  extend  Pe¬ 
tri  Nets  by  attaching  to  each  transition  an  action  on  a  set 
of  variables  which  is  executed  if  the  transition  would  fire 
according  to  Petri  Net  rules  and  if  a  predicate  attached  to 
the  transition  is  true.  A  modified  version  of  this  model 
has  been  used  by  Bochman  [BOCH  77]  in  the  specification  and 


/v 

«■  'v 


formal  verification  of  communications  protocols. 


V- 


Ellis  suggested  in  [ELLI  77]  a  different  approach  in 
which  a  production  system,  called  L-system,  is  used  to  ob- 

f’t 

tain  a  representation  of  a  protocol.  This  representation 

•  .  i 

includes  flow  of  control  as  well  as  the  semantic  of  a  paral¬ 
lel  computation  in  a  uniform  manner.  The  examination  of  the 
above  mentioned  methods  to  modelling  of  synchronization  pro¬ 
tocols  in  distributed  databases  is  a  subject  worthwhile  f 
further  investigation. 


Another  area  for  further  research  is  the  area 
formance  analysis  of  synchronization  proto::  .  - 
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models  of  such  protocols  should  take  Into  account  the  tran¬ 
saction  parameters  such  as  interarrival  time  distributions 
at  each  node,  distribution  of  the  number  of  data  items  re¬ 
quired  at  each  of  them,  amount  of  processing  time  required 
and  the  like.  With  the  use  of  queueing  models  and  or  simu¬ 
lation  models  one  should  be  able  to  obtain  results  such  as 
average  response  time,  average  communications  cost  as  a 
function  of  the  input  parameters.  Such  an  analysis  would  be 
an  important  tool  for  the  database  designer  since  it  would 
help  him  understand  the  tradeoffs  involved  in  the  use  of 
various  synchronization  protocols.  While  some  work  has  been 
done  in  the  area  (see  GARC  78  and  GELE  78)  much  more  needs 


to  be  done 
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APPENDIX  A 


PROOFS  FOR  ASSERTIONS  IN  CHAPTER  3 


PROOF  FOR  ASSERTION  Let  k  be  the  number  of  relevant 

LLCs  in  the  component  and  let  S  =  (X;  xl ,  x2,  . ..,xk)  be  a 
global  feasible  state.  The  validity  of  this  assertion  can 
be  readily  verified  by  examination  of  Table  1 .  If  a  lock  is 
!  in  the  LOCK  table  of  the  LC  the  state  X  is  equal  to  (1,0,0) 

i 

) 

if  there  are  no  pending  lock  release  requests  associated 
with  the  lock  in  question.  But  for  X  =  (1,0,0),  the  only 

!  possible  LLC  states  are  (o,1,0)  and  (1,0,0).  Therefore  the 

I 

) 

i  lock  is  present  at  every  site  and  the  assertion  is  proved. 


i 

|  PROOF  FOR  ASSERTION  2.. 2.:  Analogously  to  the  proof  of  asser- 

t 

;  tion  3.1,  this  assertion  is  easily  proved  by  examination  of 

|  Table  1.  The  only  possible  LLC  states  associated  with  the 

I 

I 

LC  state  X  =  (0,0,0)  are  (1,0,1)  and  (0,0,0).  Therefore  the 
release  request  in  question  is  present  at  every  site  and  the 
assertion  is  proved. 

| 

i 

I  PROOF  FOR  ASSERTION  3.. 3.:  We  need  to  show  that  at  the  end  of 

■  the  notification  phase,  every  site  will  end  up  with  the  same 

i 

i  value  for  the  trial  list  T  and  consequently  the  same  value 

i 

i 

for  NEWLC.  Before  doing  so,  we  must  prove  the  following  two 
assertions . 


i  Assertion  A.1:  Given  two  trial  sequences,  one  of  them  must 

i 

1 

j  be  a  prefix  of  the  other.  In  other  words,  let  X  =  x[1], 

i 

i 
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x[2],  x[p]  and  Y  =  y[1]f  y[2],  y[p],  y[p+1], 

y[q]  then  x[i]  =  y[i]  for  i  =  1,...,p. 


Proof:  This  assertion  follows  directly  from  the  faot  that 
the  nomination  is  done  following  the  nomination  order  and 
starting  at  the  node  that  follows  the  crashed  LC  in  this 
order . 


Assertion  A. 2:  The  sequence,  T[1],  T[2],  T[m]  of  trial 

sequence  values  taken  by  the  trial  list  T  is  such  that  if 

T[i]  =  iC  1  3 ,  it  2 ] . i[p],  then  Tti+1]  =  it  1 ] ,  it  2 ] ,  ..., 

Up],  itp+1],  ...,  itp+q]  for  i  s  In  other 

words,  Tti]  and  Tti+1]  have  the  same  prefix,  namely  itl], 
•  ••i  it  p] • 


Proof:  By  assertion  A.1,  either  T[i]  is  a  prefix  of  Tti+1] 
or  vice  versa.  If  Tti]  is  empty  this  assertion  is  trivial¬ 
ly  true.  So,  from  this  point  on  assume  that  Tti]  is  not 
empty.  Assume  now  that  this  assertion  is  not  valid. 
Therefore,  it  must  be  the  case  that  if  Tti+1]  =  j[1],  ..., 
J[p]  then  T[i]  =  j[1],  ...,  j[p],  Jtp+1],...,  jtp+q]  for  q  > 
0.  But,  as  T[i]  is  not  empty,  in  order  for  T  to  be  set  to 
the  value  of  Tti+1],  it  must  be  the  case  that,  during  execu¬ 
tion  of  the  algorithm  in  section  3. 2. 3.1,  Jtp+1]  was  not  in 
Tti].  This  contradicts  our  assumption  and  thus  proves  asser¬ 
tion  A. 2. 
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We  can  now  prove  assertion  3.3. 


Assume  that  this 


assertion  is  not  true.  So,  there  are  at  least  two  sites  i 
and  j  for  which  its  T  values  are  different  at  the  end  of  the 
notification  phase.  Let  T[i]  =  X,  i[1],  ...,  i[p]  and  T[j] 
=  X  where  X  is  a  common  prefix.  If  T[i]  has  the  value  X, 

i[1],  ...»  i[p]  and  the  notification  phase  has  already  end- 

i 

ed ,  then  a  NA  message  with  trial  sequence  equal  to  T[i]  must 
have  passed  through  node  j.  By  that  time,  the  algorithm  in 
section  3. 2. 3*1  would  have  changed  the  value  of  T[j]  to  that 
of  T[i].  This  contradicts  our  assumption  and  shows  that  all 
T  and  consequently  NEWLC  values  will  end  up  being  the  same 
at  the  end  of  the  notification  phase.  That  the  value  for 
NEWLC  is  the  identification  number  of  the  latest  LC  to  be 
nominated  follows  directly  from  assertion  A. 2. 


PROOF  FOR  ASSERTION  3..JI:  In  order  to  prove  this  assertion  we 
must  consider  all  the  possible  global  states  in  which  the 
set  of  LLCs  can  be  left  at  when  a  crash  occurs.  Table  1 
gives  us  the  set  of  possible  LLC  states  for  each  LC  state. 
Therefore,  when  a  crash  occurs,  the  resulting  global  state 
is  one  of  the  possible  combinations  of  LLC  states  associated 
with  the  state  of  the  crashed  LC  just  before  the  crash. 


The  list  of  all 


from  Table  1  and 
below.  Each  case  is 
where  the  set  of 


the  possible  combinations  was  derived 
the  eleven  resulting  cases  are  listed 
represented  by  a  set  S  of  LLC  states 
global  states  associated  with  S  is  such 
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that  there  is  at  least  one  LLC  state  associated  with  each 
element  of  S. 


CASE 

1  : 

(0,  0, 

0) 

CASE 

2: 

(1,0, 

0) 

CASE 

3 

(0,  1, 

0) 

CASE 

4: 

(1  ,  o, 

i) 

CASE 

5: 

o 

o 

O 

o 

1,  0) 

CASE 

6: 

(0,  1, 

0;  1 ,  0,  0) 

CASE 

7: 

(0,  0, 

0;  i , 

0,  1) 

CASE 

8: 

( i ,  o, 

0;  i ,  o,  i ) 

CASE 

9:  (0, 

1,  0; 

1,  o, 

1) 

CASE 

10  :  (0  , 

0,  0; 

0,  1, 

o; 

i,  o, 

1) 

CASE 

11  :(0, 

1,  0; 

1  ,  o, 

0; 

1,0, 

1) 

The  actions  taken  in  each  of  the  above  cases  by  the  al¬ 
gorithm  that  builds  the  sets  L  and  R  are  summarized  in 


:ss  =  =  =  r  =  =  3: 

CASES 
1  4  2 
3,5  4  6 
4,7  4  8 
9,10  4  11 


ACTION 
no  action 

add  lock  to  the  set  L 
add  lock  to  the  set  R 


If  seq# ( 0 , 1,0)  >  seq# ( 1 ,0,1) 
Then  add  lock  to  the  set  L 
Else  add  lock  to  the  set  R 


Table  A. 1 

Table  A.1. 


Actions  taken  by  the  algorithm 
that  builds  the  sets  L  and  R. 


Assertion  3*4  states  that  a  lock  p  is  either  in  all  the 
the  LOCK  tables  of  S(p)  or  it  is  in  none  of  them.  Let  us 
prove  this  statement  as  a  lemma. 
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Lemma :  At  the  end  of  the  LCR  mechanism  the  following  is 
true.  If  the  lock  p  is  in  one  LOCK  table  of  a  site  in  S(p) 
then  it  must  be  in  the  LOCK  table  of  all  of  them. 


Proof:  Assume  that  it  is  not  in  at  least  one  LOCK  table  but 
that  it  is  in  some  of  them.  Then,  this  lock  was  not  includ¬ 
ed  in  the  set  L  otherwise  it  would  have  been  added  to  all 
the  LOCK  tables.  In  a  similar  vein,  it  was  not  included  in 
the  set  R  otherwise  it  would  have  been  deleted  from  all  the 
LOCK  tables.  Therefore,  the  lock  was  in  no  L-list  nor  R- 
list  when  the  LCR  mechanism  started.  This  eliminates  cases 
3  through  11.  We  are  then  left  with  cases  1  and  2.  Case  1 
is  also  eliminated  because  by  assumption  the  lock  is  in  at 
least  one  LOCK  table.  Case  2  is  also  eliminated  because  by 
assumption  the  lock  is  not  in  at  least  one  LOCK  table.  All 
the  cases  have  been  eliminated.  This  is  a  contradiction  and 
proves  the  lemma. 


The  complete  statement  of  assertion  3  •  *»  now  follows 
directly  from  the  algorithm  which  builds  the  sets  L  and  R. 
As  indicated  in  Table  A.1,  the  actions  taken  in  each  case 
are  such  that  maximum  forward  progress  is  achieved  in  each 
case.  In  other  words,  locks  that  would  have  been  granted 
are  added  to  all  the  LOCK  tables  and  locks  that  would  have 
been  released  are  deleted  from  all  the  LOCK  tables.  Thus, 
assertion  3.M  is  proved. 
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APPENDIX  B 


ALGORITHM  FOR  DISTRIBUTED  CRASH  SET  CALCULATION 

An  Algol-like  description  of  the  algorithm  to  do  crash 
set  calculation  in  a  distributed  system  is  given  in  this  ap¬ 
pendix  . 

Procedure  CRASHSET  is  executed  at  the  site  where  the 
error  is  detected.  In  order  to  make  explicit  the  parallel 
computation  which  exists  in  this  distributed  algorithm  we 
introduced  some  constructs  to  the  language. 

One  of  them  is  the  COBEGIN  statement.  It  is  used  to 
indicate  that  a  message  is  sent  to  several  sites  causing  a 
certain  algorithm  to  be  executed  at  those  sites.  The  syntax 
of  the  COBEGIN  statement  is  given  below. 

COBEGIN  FOR  j  :  =  a  STEP  b  UNTIL  c  DO 

<boolean  expression>  :  A(p1,  p2 ,  ...,  pn)  ; 

COEND; 

Its  semantics  is  as  follows.  The  variable  JL  (called  a 
site  variable)  is  incremented  according  to  the  FOR  clause. 
For  each  value  of  J  the  boolean  expression  is  evaluated  and 
if  it  is  true  a  message  with  parameters  pi,  p2,  ...,  pn  is 
sent  co  site  j  causing  algorithm  A  to  be  executed  at  that 


Another  construct  is  the  REPLY  declaration  which  may 
appear  in  the  procedure  heading  and  it  is  used  to  simulate 
the  fact  that  some  messages  require  a  reply  to  the  sender 
site.  In  particular,  assume  that  site  Si  sends  a  message  to 
site  Sj  which  causes  algorithm  A  to  be  executed  at  S j .  If 
as  a  result  of  this  message  and  of  A's  execution  a  message 


containing  certain  parameters  must  be  sent  back  to  Si,  these 
parameters  must  be  declared  as  REPLY  in  the  procedure  body 
of  A.  Also,  a  reply  variable  must  be  assigned  a  value  only 
once  during  the  execution  of  algorithm  A.  When  algorithm  A 
returns,  the  reply  message  is  considered  to  be  sent  to  the 
calling  site. 


The  procedure  FCRASHSET  is  the  one  executed  to  continue 
the  crash  set  calculation  at  a  foreign  site. 

The  DFS  procedure  differs  from  the  one  given  in  chapter 
4  in  two  aspects. 


a.  When  examining  the  adjacency  list  of  the  node 
currently  being  visited  the  DFS  procedure  marks  as 
output  connections  the  non  local  nodes. 

b.  As  soon  as  we  visit  a  node  such  that  its  label 
contains  an  object  x  then  we  know  right  away  that 
the  object  x  must  be  backed  up  although  we  may 
still  not  know  which  historical  version  should  be 
used  to  restore  it.  We  must  at  this  point  prevent 
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object  x  from  interacting  with  other  objects  since 
otherwise  the  graph  could  grow  in  such  a  way  that 
some  objects  which  should  be  in  the  crash  set  end 
up  not  being  in  it.  This  operation  is  called 
"freezing"  the  object.  Figure  B.1  helps  to  illus¬ 
trate  the  point.  If  in  the  graph  of  figure  B.1. a 
the  interaction  edge  between  nodes  4  and  6  oc¬ 
curred  after  node  4  was  visited  in  calculating 
G*(2)  then  we  would  obtain  as  a  result  the  graph 
in  figure  B.I.b  while  the  correct  result  for  G*(2) 
is  the  graph  in  figure  B.I.c  . 

Once  the  graph  G*(X)  has  been  found  each  of  the  sites 
which  contributed  to  find  G*(X)  are  requested  to  find  the 
subset  of  the  crash  set  local  to  them.  This  is  indicated  by 
sending  a  message  to  each  of  these  sites  and  having  them  ex¬ 
ecute  the  FCALC  procedure.  This  procedure  basically  finds 
the  set  of  branches  incident  into  nodes  of  the  portion  of 
Gtt(X)  local  to  these  sites  and  which  are  not  in  G*(X). 

Finally,  crash_set(X)  is  simply  the  union  of  all  the 
partial  crash  sets  obtained  at  each  participating  site. 


PROCEDURE  CRASHSET(X)  ; 

BEGIN 

COMMENT  nodesetCj]  =  true  if  node  j  is  in  G*(X) 

nsets[k,#]  is  the  buffer  for  sending  and 
receiving  nodeset  to  site  k. 
last[J]  =  true  if  node  j  is  lastnode  for 
any  object. 

processedCj]  =  true  if  node  j  is  numbered  & 

last[j]  s  true. 

fsite[k]  =  true  if  there  is  an  output  connec¬ 
tion  to  site  k. 

aiteCj]  =  0  if  node  j  is  local  A 

s  k  if  node  j  is  local  to  site  k. 
outconnectionC j]  s  true  if  node  j  may  be  used 
as  a  root  in  applying  DPS  at  a 
foreign  site. 


INTEGER  i,j  ; 

INTEGER  ARRAY  site[1 :n]  ; 

BOOLEAN  ARRAY  nodesetC  1  :n]  ,  last[1:n],  processed!!  1  :  n]  , 
fsite[1:m],  outconnectionC 1 : n] , 
nsetst 1 :m ,  1 : n]  ; 


PROCEDURE  DFS(v.u)  ; 

BEGIN 

number[ v]  is  i  :  a  i  ♦  1 ; 
nodesetC v]  :a  true  ; 


COMMENT  every  object  found  to  be  in  the  crash 
set  should  be  freezed  .  ; 

FOR  EACH  object  z  C  label(v)  DO  freeze  z  ; 


FOR  EACH  y  ->  C  into(v) 

DO  S(y,X)  : a  S(y ,X)  U  ty  ->  v} ; 

FOR  w  in  the  adjacency  list  of  v  DO 

IF  siteCw]  a  0  COMMENT  w  is  local; 

THEN  IF  w  is  not  numbered 
THEN  DFS(w.v)  ; 

ELSE  ; 

ELSE  IF  nodesetCw]  a  false 
THEN  BEGIN 

COMMENT  w  is  an  output 

connection ; 

outconnectionC w]  is  true  ; 
f site[ site[ w] ]  : =  true  ; 
END; 

END  dfs; 
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COMMENT  initialization  phase; 
i  :  s  0  ; 

processed  :  =  nodeset  is  outconnection  :=  false; 
fsite  : =  false  ; 
crash_set(X)  :=  6  ; 

FOR  EACH  node  y  in  G1  do  S(y,X)  :=  0; 

COMMENT  apply  DFS  to  local  graph  for  each 
object  x  C  X.  ; 

FOR  EACH  object  x  C  X  DO 

IF  processedC lastnode( x) ]  =  false 
THEN  DFS( lastnode( x) , 0 )  ; 

COMMENT  continue  to  apply  depth  first  search  at 

foreign  sites  which  have  output  connections.  ; 
FOR  j  :s  1  STEP  1  UNTIL  m  DO 
nsetsCj,*]  :=  nodeset; 

COBEGIN  FOR  j  :=  1  STEP  1  UNTIL  m  DO 

fsiteCj]  =  true  :  FCRASHSET( outconnection , 

nsets[ j  ,  * ] , J)  ; 

COEND  ; 

COMMENT  nodeset  is  now  obtained  by  "oring"  all  the 
nodesets  obtained  at  different  sites.  ; 

FOR  J  : s  1  STEP  1  UNTIL  m  DO 

nodeset  :s  nodeset  V  nsetsCj,*]  ; 

COMMENT  start  final  computation  of  crash  set(X)  local¬ 
ly  by  taking  the  union  of  all  S(y,X)  for  nodes 
y  C  G* ( X ) .  ; 

FOR  j  :s  1  STEP  1  UNTIL  n  DO 
IF  nodesetCj]  s  false 

THEN  crash_set(X)  :=  crash_set(X)  U  S(J,X)  ; 

ELSE  f siteC site[ J ] ]  :*  true  ; 

COMMENT  continue  final  computation  of  crash  set  at 
foreign  sites 

FOR  J  : s  1  STEP  1  UNTIL  m  DO 

crash_set( J ,X)  :*  crash_set(X)  ; 

COBEGIN  FOR  J  :=  1  STEP  1  UNTIL  m  DO 

fsiteCj]  *  true  :  FCALC( crash_set( j , X) ,  nodeset)  ; 
COEND; 


COMMENT  complete  crash  set  calculation  by  taking 

the  union  of  all  the  partial  computations; 

FOR  J  : s  1  STEP  1  UNTIL  m  DO 
IF  fsiteCj]  *  true 

THEN  crash_set(X)  :*  crash_set(X)  U  crash_set( J , X) ; 
END  crashset; 


PROCEDURE  FCR AS HSET(roots, nset, isite)  ; 

VALUE  roots,  isite;  REPLY  nset; 

BOOLEAN  ARRAY  roots  ; 

INTEGER  isite  ; 

BEGIN 

COMMENT  nodesetCj]  =  true  if  node  J  is  in 

the  local  subgraph. 

nsetsCk,*]  is  the  buffer  for  sending  and  recei¬ 
ving  nodeset  to  site  k. 
fsiteCk]  =  true  if  there  is  an  output  connec¬ 
tion  to  site  k. 

site[j]  s  0  if  node  J  is  local  & 

s  k  if  node  j  is  local  to  site  k. 
outconnection[ j ]  =  true  if  node  j  may  be  used 
as  a  root  in  applying  DFS  at  a 
foreign  site. 


INTEGER  i,  J; 

INTEGER  ARRAY  sited :n]  ; 

BOOLEAN  ARRAY  nodesetC 1 :n]  ,  fsite[1:m], 

outconneetiont 1 : n] ,  nsetst 1 : m , 1 : n]  ; 
COMMENT  initialization  phase 
i  :s  0  ;  nodeset  :a  nodeset  V  nset  ; 
outconnection  :=  false;  fsite  :=  false; 

FOR  EACH  node  y  in  G1  DO  S(y,X)  :=  0  ; 


COMMENT  apply  DFS  starting  at  each  not  yet  visited 
root  ; 

FOR  J  :a  1  STEP  1  UNTIL  n  DO 

IF  rootsCj]  s  true  &  site[J]  a  isite  & 
nodesetC J]  =  false 
THEN  DFS( j , 0 )  ; 


COMMENT  continue  to  apply  DFS  at  foreign  sites 
FOR  j :a  1  STEP  1  UNTIL  m  DO 
nsetsCj,*]  :a  nodeset; 

COBEGIN  FOR  J  :a  1  STEP  1  UNTIL  m  DO 

fsiteCj]  a  true  :  FCRASHSET( outconnection , 

nsets( J  , * )  ,  J )  ; 

COEND  ; 


COMMENT  nodeset  is  now  obtained  by  "oring"  the  nsets 
obtained  at  different  sites; 

FOR  J  : a  1  STEP  1  UNTIL  m  DO 

nodeset  : =  nodeset  V  nsetsCj,*]  ; 


COMMENT  return  value  of  nodeset  to  calling  site; 
nset  :a  nodeset; 

END  fcrashset  ; 
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PROCEDURE  FCALC( cset , nodeset )  ; 

VALUE  nodeset  ;  REPLY  cset  ; 

BOOLEAN  ARRAY  nodeset  ; 

BEGIN 

INTEGER  j  ; 

crash_set(X)  :=  cset  ; 

FOR  J : s  1  STEP  1  UNTIL  n  DO 
IF  nodesetCj]  =  false 

THEN  crash_set(X)  :s  crash_set(X)  U  S(j,X)  ; 
COMMENT  return  value  of  crash  set  to  calling  site 
cset  :=  crash__set  ( X )  ; 

END  fcalc  ; 
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